feat: cute dsl mmfp4 for blackwell by nv-yunzheq · Pull Request #2540 · flashinfer-ai/flashinfer

nv-yunzheq · 2026-02-11T19:49:28Z

📌 Description

The PR integrate cute_dsl as a new backend for mm_fp4
dense_blockscaled_gemm_sm100.py comes from dense_blockscaled_gemm_persistent.py from TensorRT-LLM.
dense_blockscaled_gemm_sm103.py comes from sm103_dense_blockscaled_gemm_persistent.py from CUTLASS. This file is integrated, but is not currently being used as it requires a pre-released version of nvidia-cutlass-dsl.
gemm_base.py contains main wrapper logic for the mm_fp4 cute dsl gemm kernel
Also upate mm_fp4 unit test and benchmark script to test cute_dsl backend

The performance data:

MMFP4 Benchmark Results

GB200 Non-Autotune

m	n	k	cudnn_time	cudnn_tflops	cudnn_tb_per_sec	cutlass_time	cutlass_tflops	cutlass_tb_per_sec	trtllm_time	trtllm_tflops	trtllm_tb_per_sec	cute_dsl_time	cute_dsl_tflops	cute_dsl_tb_per_sec	best_backend	cute_dsl_vs_best_other_speedup
1	512	7168	0.009808	0.7484	0.1876	0.009312	0.7882	0.1976	0.010912	0.6727	0.1686	0.009824	0.7472	0.1873	cutlass	0.9479
1	896	1024	0.004176	0.4394	0.1104	0.004720	0.3888	0.0977	0.004608	0.3982	0.1001	0.004384	0.4186	0.1052	cudnn	0.9526
1	896	5120	0.008192	1.1200	0.2805	0.007888	1.1632	0.2913	0.008896	1.0314	0.2583	0.008128	1.1288	0.2827	cutlass	0.9705
1	1024	7168	0.010224	1.4358	0.3595	0.009600	1.5292	0.3829	0.011216	1.3089	0.3277	0.010192	1.4404	0.3606	cutlass	0.9419
1	1280	8192	0.011312	1.8539	0.4641	0.010448	2.0072	0.5024	0.012448	1.6847	0.4217	0.011311	1.8540	0.4641	cutlass	0.9237
1	1792	5120	0.008528	2.1517	0.5387	0.008416	2.1804	0.5458	0.009328	1.9672	0.4925	0.008592	2.1357	0.5346	cutlass	0.9795
1	2560	8192	0.012160	3.4493	0.8631	0.011440	3.6663	0.9174	0.012896	3.2524	0.8138	0.011888	3.5282	0.8828	cutlass	0.9623
1	3584	5120	0.009376	3.9143	0.9796	0.009360	3.9210	0.9813	0.010048	3.6527	0.9141	0.009408	3.9010	0.9763	cutlass	0.9949
1	4608	7168	0.011776	5.6097	1.4035	0.011264	5.8647	1.4673	0.012960	5.0972	1.2753	0.012320	5.3620	1.3415	cutlass	0.9143
1	5120	640	0.004384	1.4949	0.3761	0.005024	1.3045	0.3282	0.004704	1.3932	0.3505	0.004832	1.3563	0.3413	cudnn	0.9073
1	5120	1024	0.004800	2.1845	0.5484	0.005424	1.9332	0.4853	0.005168	2.0290	0.5093	0.004992	2.1005	0.5273	cudnn	0.9615
1	5120	1280	0.005248	2.4976	0.6265	0.005696	2.3011	0.5772	0.005568	2.3540	0.5905	0.005408	2.4237	0.6079	cudnn	0.9704
1	5120	2048	0.006432	3.2605	0.8169	0.006752	3.1060	0.7782	0.006608	3.1737	0.7951	0.006528	3.2125	0.8049	cudnn	0.9853
1	5120	2560	0.007184	3.6490	0.9139	0.007488	3.5009	0.8768	0.007392	3.5463	0.8881	0.007360	3.5617	0.8920	cudnn	0.9761
1	5120	4096	0.008864	4.7318	1.1843	0.008832	4.7490	1.1886	0.009440	4.4431	1.1121	0.009072	4.6234	1.1572	cutlass	0.9735
1	5120	5120	0.009712	5.3984	1.3509	0.009664	5.4252	1.3576	0.010576	4.9573	1.2405	0.010048	5.2178	1.3057	cutlass	0.9618
1	5120	8192	0.013264	6.3243	1.5822	0.012560	6.6788	1.6708	0.014400	5.8254	1.4574	0.013696	6.1249	1.5323	cutlass	0.9171
1	5120	16384	0.020928	8.0166	2.0050	0.020480	8.1920	2.0489	0.025024	6.7045	1.6768	0.023968	6.9998	1.7507	cutlass	0.8545
1	7168	256	0.003888	0.9439	0.2397	0.004656	0.7882	0.2002	0.004080	0.8995	0.2284	0.004032	0.9102	0.2311	cudnn	0.9643
1	7168	512	0.004160	1.7644	0.4446	0.004928	1.4895	0.3753	0.004624	1.5874	0.4000	0.004512	1.6268	0.4099	cudnn	0.9220
1	7168	4608	0.009664	6.8357	1.7106	0.009840	6.7134	1.6801	0.010752	6.1443	1.5376	0.010432	6.3325	1.5847	cudnn	0.9264
1	7168	5120	0.010336	7.1014	1.7770	0.010400	7.0577	1.7661	0.011968	6.1330	1.5347	0.011792	6.2246	1.5576	cudnn	0.8765
1	8192	1024	0.005184	3.2363	0.8123	0.005936	2.8264	0.7094	0.005520	3.0394	0.7629	0.005440	3.0840	0.7741	cudnn	0.9529
1	8192	2048	0.007200	4.6603	1.1675	0.007888	4.2539	1.0657	0.007632	4.3965	1.1014	0.007392	4.5393	1.1372	cudnn	0.9740
1	8192	3584	0.008800	6.6728	1.6703	0.009360	6.2735	1.5703	0.010000	5.8720	1.4698	0.009824	5.9772	1.4962	cudnn	0.8958
1	8192	4096	0.009328	7.1943	1.8006	0.009952	6.7433	1.6877	0.010848	6.1863	1.5483	0.010896	6.1590	1.5415	cudnn	0.8561
1	8192	7168	0.013344	8.8010	2.2017	0.013584	8.6455	2.1628	0.016608	7.0713	1.7690	0.016480	7.1262	1.7828	cudnn	0.8097
1	8192	8192	0.014624	9.1779	2.2959	0.014400	9.3207	2.3316	0.018672	7.1882	1.7981	0.017968	7.4698	1.8686	cutlass	0.8014
1	8192	14336	0.022528	10.4262	2.6076	0.022096	10.6300	2.6586	0.029600	7.9353	1.9846	0.028496	8.2426	2.0615	cutlass	0.7754
1	8192	28672	0.040080	11.7206	2.9309	0.040080	11.7206	2.9309	0.057424	8.1806	2.0457	0.054272	8.6557	2.1645	cudnn	0.7385
1	9216	7168	0.013648	9.6806	2.4218	0.016752	7.8869	1.9730	0.019040	6.9393	1.7360	0.018496	7.1432	1.7870	cudnn	0.7379
1	10240	8192	0.016496	10.1705	2.5441	0.025280	6.6366	1.6601	0.021904	7.6594	1.9160	0.021104	7.9498	1.9886	cudnn	0.7817
4	512	7168	0.009792	2.9984	0.1893	0.009056	3.2421	0.2047	0.010944	2.6828	0.1694	0.009824	2.9886	0.1887	cutlass	0.9218
4	896	1024	0.004176	1.7577	0.1121	0.004656	1.5765	0.1005	0.004608	1.5929	0.1016	0.004368	1.6804	0.1071	cudnn	0.9560
4	1024	7168	0.010112	5.8070	0.3652	0.009376	6.2628	0.3938	0.011152	5.2654	0.3311	0.010064	5.8347	0.3669	cutlass	0.9316
4	4608	7168	0.011744	22.5001	1.4106	0.011408	23.1628	1.4522	0.012864	20.5411	1.2878	0.012320	21.4481	1.3447	cutlass	0.9260
4	7168	256	0.003808	3.8551	0.2561	0.004544	3.2306	0.2146	0.004096	3.5840	0.2381	0.004096	3.5840	0.2381	cudnn	0.9297
4	7168	512	0.004336	6.7712	0.4367	0.005024	5.8440	0.3769	0.004592	6.3938	0.4123	0.004560	6.4386	0.4152	cudnn	0.9509
4	7168	2304	0.006976	18.9393	1.1926	0.007488	17.6443	1.1110	0.007360	17.9512	1.1304	0.007232	18.2689	1.1504	cudnn	0.9646
4	7168	4608	0.009712	27.2077	1.7073	0.009824	26.8975	1.6879	0.011120	23.7627	1.4912	0.010928	24.1802	1.5174	cudnn	0.8887
4	9216	7168	0.013488	39.1817	2.4554	0.013728	38.4967	2.4125	0.018992	27.8266	1.7438	0.018032	29.3080	1.8366	cudnn	0.7480
8	896	5120	0.008128	9.0306	0.2865	0.007840	9.3623	0.2970	0.008976	8.1774	0.2594	0.008224	8.9251	0.2831	cutlass	0.9533
8	1280	8192	0.011248	14.9157	0.4709	0.010336	16.2318	0.5124	0.012416	13.5126	0.4266	0.011296	14.8524	0.4688	cutlass	0.9150
8	1792	5120	0.008608	17.0540	0.5386	0.008448	17.3770	0.5488	0.009408	15.6038	0.4928	0.008640	16.9908	0.5367	cutlass	0.9778
8	2560	8192	0.012160	27.5941	0.8684	0.011392	29.4544	0.9269	0.013056	25.7004	0.8088	0.011888	28.2255	0.8882	cutlass	0.9583
8	3584	5120	0.009360	31.3677	0.9886	0.009344	31.4214	0.9902	0.010032	29.2665	0.9223	0.009472	30.9968	0.9769	cutlass	0.9865
8	5120	640	0.004416	11.8725	0.3901	0.005072	10.3369	0.3397	0.004816	10.8864	0.3577	0.004752	11.0330	0.3626	cudnn	0.9293
8	5120	1024	0.004736	17.7124	0.5717	0.005376	15.6038	0.5036	0.005232	16.0333	0.5175	0.004992	16.8041	0.5424	cudnn	0.9487
8	5120	1280	0.005136	20.4162	0.6550	0.005664	18.5130	0.5939	0.005664	18.5130	0.5939	0.005312	19.7398	0.6333	cudnn	0.9669
8	5120	2048	0.006336	26.4792	0.8417	0.006720	24.9661	0.7936	0.006688	25.0856	0.7974	0.006544	25.6376	0.8149	cudnn	0.9682
8	5120	2560	0.007152	29.3226	0.9292	0.007488	28.0068	0.8875	0.007360	28.4939	0.9030	0.007312	28.6810	0.9089	cudnn	0.9781
8	5120	4096	0.008848	37.9232	1.1962	0.008832	37.9919	1.1984	0.009168	36.5995	1.1545	0.008976	37.3824	1.1792	cutlass	0.9840
8	5120	5120	0.009824	42.6945	1.3446	0.009680	43.3296	1.3646	0.010624	39.4795	1.2434	0.009984	42.0103	1.3231	cutlass	0.9696
8	5120	8192	0.013488	49.7545	1.5633	0.012672	52.9584	1.6640	0.014368	46.7072	1.4676	0.013648	49.1712	1.5450	cutlass	0.9285
8	5120	16384	0.021695	61.8643	1.9401	0.020752	64.6770	2.0283	0.024848	54.0155	1.6939	0.023136	58.0125	1.8193	cutlass	0.8970
8	7168	5120	0.010288	57.0765	1.7968	0.010240	57.3440	1.8052	0.011632	50.4817	1.5892	0.011312	51.9097	1.6341	cutlass	0.9052
8	8192	1024	0.005264	25.4973	0.8225	0.005984	22.4294	0.7235	0.005600	23.9675	0.7731	0.005568	24.1052	0.7776	cudnn	0.9454
8	8192	2048	0.007136	37.6171	1.1950	0.007488	35.8488	1.1389	0.007424	36.1578	1.1487	0.007392	36.3143	1.1537	cudnn	0.9654
8	8192	3584	0.008784	53.4793	1.6878	0.008960	52.4288	1.6546	0.009808	47.8958	1.5116	0.009424	49.8474	1.5732	cudnn	0.9321
8	8192	4096	0.009392	57.1626	1.8020	0.009568	56.1111	1.7689	0.010720	50.0812	1.5788	0.010496	51.1500	1.6125	cudnn	0.8948
8	8192	7168	0.013280	70.7473	2.2229	0.012960	72.4941	2.2778	0.015984	58.7790	1.8468	0.015536	60.4740	1.9001	cutlass	0.8342
8	8192	8192	0.014752	72.7862	2.2857	0.014160	75.8292	2.3812	0.017248	62.2531	1.9549	0.016704	64.2805	2.0186	cutlass	0.8477
8	8192	14336	0.022288	84.3076	2.6431	0.021824	86.1001	2.6993	0.027936	67.2626	2.1087	0.027040	69.4914	2.1786	cutlass	0.8071
8	8192	28672	0.040096	93.7275	2.9351	0.039584	94.9398	2.9731	0.053824	69.8219	2.1865	0.050464	74.4708	2.3321	cutlass	0.7844
8	10240	8192	0.016352	82.0803	2.5770	0.020640	65.0280	2.0416	0.020720	64.7769	2.0338	0.019472	68.9303	2.1642	cudnn	0.8398
16	512	7168	0.009824	11.9544	0.1943	0.009184	12.7875	0.2078	0.010960	10.7154	0.1742	0.009856	11.9156	0.1937	cutlass	0.9318
16	896	1024	0.004256	6.8985	0.1165	0.004736	6.1994	0.1046	0.004608	6.3716	0.1076	0.004352	6.7464	0.1139	cudnn	0.9779
16	1024	7168	0.010048	23.3759	0.3742	0.009408	24.9661	0.3997	0.011296	20.7933	0.3329	0.010032	23.4132	0.3748	cutlass	0.9378
16	4608	7168	0.011824	89.3915	1.4141	0.011360	93.0427	1.4718	0.012656	83.5149	1.3211	0.011936	88.5527	1.4008	cutlass	0.9517
16	7168	256	0.003936	14.9188	0.2919	0.004608	12.7431	0.2493	0.004080	14.3922	0.2816	0.004000	14.6801	0.2872	cudnn	0.9840
16	7168	512	0.004208	27.9089	0.4916	0.004896	23.9870	0.4225	0.004544	25.8452	0.4552	0.004448	26.4030	0.4650	cudnn	0.9460
16	7168	2304	0.007008	75.4113	1.2137	0.007488	70.5772	1.1359	0.007456	70.8801	1.1407	0.007392	71.4938	1.1506	cudnn	0.9481
16	7168	4608	0.009664	109.3713	1.7365	0.009808	107.7656	1.7110	0.010560	100.0913	1.5891	0.010272	102.8976	1.6337	cudnn	0.9408
16	9216	7168	0.013648	154.8893	2.4460	0.013632	155.0711	2.4488	0.016240	130.1681	2.0556	0.015680	134.8169	2.1290	cutlass	0.8694
64	512	7168	0.009872	47.5853	0.2158	0.009120	51.5090	0.2335	0.011136	42.1841	0.1913	0.009712	48.3692	0.2193	cutlass	0.9390
64	896	1024	0.004256	27.5941	0.1424	0.004672	25.1371	0.1298	0.004848	24.2245	0.1250	0.004384	26.7884	0.1383	cudnn	0.9708
64	896	5120	0.008240	71.2624	0.3122	0.007920	74.1417	0.3248	0.009120	64.3862	0.2820	0.008160	71.9611	0.3152	cutlass	0.9706
64	1280	8192	0.011200	119.8373	0.5061	0.010400	129.0555	0.5451	0.012720	105.5171	0.4457	0.011248	119.3259	0.5040	cutlass	0.9246
64	1792	5120	0.008720	134.6795	0.5712	0.008384	140.0769	0.5941	0.009600	122.3339	0.5188	0.008624	136.1787	0.5775	cutlass	0.9722
64	2560	8192	0.012224	219.5971	0.9061	0.011632	230.7733	0.9522	0.013120	204.6002	0.8442	0.012048	222.8050	0.9193	cutlass	0.9655
64	3584	5120	0.009504	247.1391	1.0309	0.009408	249.6610	1.0414	0.010112	232.2795	0.9689	0.009472	247.9741	1.0344	cutlass	0.9932
64	4608	7168	0.011872	356.1202	1.4601	0.011408	370.6047	1.5195	0.012896	327.8426	1.3442	0.011712	360.9852	1.4800	cutlass	0.9740
64	5120	640	0.004496	93.2897	0.5147	0.005088	82.4352	0.4548	0.004912	85.3889	0.4711	0.004704	89.1646	0.4920	cudnn	0.9558
64	5120	1024	0.004864	137.9705	0.6804	0.005408	124.0918	0.6120	0.005440	123.3619	0.6084	0.004992	134.4328	0.6630	cudnn	0.9744
64	5120	1280	0.005312	157.9181	0.7480	0.005792	144.8309	0.6860	0.005792	144.8309	0.6860	0.005376	156.0381	0.7390	cudnn	0.9881
64	5120	2048	0.006432	208.6718	0.9272	0.006752	198.7822	0.8833	0.006992	191.9590	0.8529	0.006592	203.6070	0.9047	cudnn	0.9757
64	5120	2560	0.007216	232.5002	1.0104	0.007520	223.1013	0.9695	0.007648	219.3674	0.9533	0.007360	227.9513	0.9906	cudnn	0.9804
64	5120	4096	0.008960	299.5931	1.2581	0.008896	301.7485	1.2671	0.009440	284.3596	1.1941	0.008960	299.5931	1.2581	cutlass	0.9929
64	5120	5120	0.009984	336.0821	1.3949	0.009872	339.8950	1.4107	0.010608	316.3125	1.3128	0.009856	340.4468	1.4130	cute_dsl	1.0016
64	5120	8192	0.013536	396.6245	1.6171	0.012768	420.4816	1.7144	0.014784	363.1432	1.4806	0.013280	404.2703	1.6483	cutlass	0.9614
64	5120	16384	0.021168	507.2476	2.0372	0.020736	517.8153	2.0796	0.024736	434.0806	1.7433	0.022336	480.7225	1.9306	cutlass	0.9284
64	7168	256	0.004032	58.2542	0.4571	0.004672	50.2742	0.3945	0.004320	54.3706	0.4267	0.004048	58.0240	0.4553	cudnn	0.9960
64	7168	512	0.004256	110.3764	0.6506	0.004960	94.7101	0.5582	0.004736	99.1896	0.5846	0.004544	103.3807	0.6094	cudnn	0.9366
64	7168	2304	0.007168	294.9120	1.2903	0.007488	282.3089	1.2351	0.007584	278.7354	1.2195	0.007312	289.1041	1.2649	cudnn	0.9803
64	7168	4608	0.009856	428.9629	1.7837	0.009952	424.8250	1.7665	0.010400	406.5248	1.6904	0.009760	433.1822	1.8012	cute_dsl	1.0098
64	7168	5120	0.010464	448.9316	1.8570	0.010336	454.4911	1.8800	0.011264	417.0473	1.7251	0.010400	451.6943	1.8684	cutlass	0.9938
64	8192	1024	0.005408	198.5469	0.9755	0.005920	181.3753	0.8912	0.005728	187.4549	0.9210	0.005520	194.5184	0.9557	cudnn	0.9797
64	8192	2048	0.007296	294.3371	1.3025	0.007632	281.3789	1.2451	0.007520	285.5696	1.2637	0.007360	291.7777	1.2911	cudnn	0.9913
64	8192	3584	0.008976	418.6828	1.7651	0.009136	411.3503	1.7342	0.009552	393.4356	1.6586	0.009280	404.9673	1.7073	cudnn	0.9672
64	8192	4096	0.009584	448.1393	1.8736	0.009680	443.6950	1.8550	0.010080	426.0880	1.7814	0.009568	448.8887	1.8768	cute_dsl	1.0017
64	8192	7168	0.013424	559.9071	2.2823	0.012912	582.1091	2.3728	0.014560	516.2220	2.1043	0.013296	565.2973	2.3043	cutlass	0.9711
64	8192	8192	0.014656	586.1036	2.3789	0.014144	607.3200	2.4650	0.016128	532.6100	2.1618	0.014592	588.6742	2.3893	cutlass	0.9693
64	8192	14336	0.022608	664.9144	2.6640	0.021920	685.7840	2.7476	0.025600	587.2026	2.3526	0.023664	635.2428	2.5451	cutlass	0.9263
64	8192	28672	0.039760	756.1562	3.0032	0.039520	760.7483	3.0214	0.047456	633.5294	2.5162	0.043760	687.0377	2.7287	cutlass	0.9031
64	9216	7168	0.013871	609.5748	2.4827	0.014208	595.1377	2.4239	0.014816	570.7152	2.3245	0.013664	618.8317	2.5204	cute_dsl	1.0152
64	10240	8192	0.016720	642.1901	2.6026	0.020848	515.0335	2.0873	0.017920	599.1863	2.4283	0.016736	641.5761	2.6001	cudnn	0.9990
256	512	7168	0.010224	183.7880	0.2949	0.009376	200.4104	0.3215	0.010848	173.2161	0.2779	0.009824	191.2712	0.3069	cutlass	0.9544
256	896	1024	0.004384	107.1538	0.2392	0.004832	97.2190	0.2170	0.004992	94.1030	0.2101	0.004352	107.9416	0.2409	cute_dsl	1.0074
256	1024	7168	0.008736	430.1850	0.5851	0.009760	385.0509	0.5238	0.011616	323.5276	0.4401	0.010208	368.1521	0.5008	cudnn	0.8558
256	4608	7168	0.010560	1601.4615	1.8742	0.011712	1443.9407	1.6899	0.013472	1255.3024	1.4691	0.012048	1403.6715	1.6428	cudnn	0.8765
256	7168	256	0.004064	231.1821	1.1369	0.004672	201.0968	0.9889	0.004544	206.7615	1.0168	0.004064	231.1821	1.1369	cudnn	1.0000
256	7168	512	0.004416	425.5091	1.2614	0.005072	370.4748	1.0983	0.004960	378.8404	1.1231	0.004464	420.9337	1.2479	cudnn	0.9892
256	7168	2304	0.006912	1223.3387	1.7683	0.007792	1085.1793	1.5686	0.007936	1065.4885	1.5401	0.007360	1148.8746	1.6607	cudnn	0.9391
256	7168	4608	0.009184	1841.4018	2.2621	0.010080	1677.7216	2.0610	0.010864	1556.6489	1.9123	0.010160	1664.5112	2.0448	cudnn	0.9039
256	9216	7168	0.012576	2689.4774	3.0746	0.013472	2510.6048	2.8701	0.016080	2103.4122	2.4046	0.014624	2312.8328	2.6440	cudnn	0.8600
512	896	5120	0.007456	630.0879	0.6065	0.008320	564.6178	0.5435	0.009248	507.9607	0.4890	0.008448	556.0630	0.5353	cudnn	0.8825
512	1280	8192	0.011728	915.5370	0.7376	0.011056	971.1847	0.7824	0.012816	837.8135	0.6750	0.011584	926.9180	0.7468	cutlass	0.9544
512	1792	5120	0.007872	1193.5011	0.9824	0.008816	1065.7034	0.8772	0.009616	977.0425	0.8042	0.008816	1065.7034	0.8772	cudnn	0.8929
512	2560	8192	0.010768	1994.3199	1.4120	0.011920	1801.5802	1.2755	0.013536	1586.4980	1.1233	0.012208	1759.0790	1.2454	cudnn	0.8820
512	3584	5120	0.008960	2097.1520	1.5799	0.009856	1906.5018	1.4363	0.010528	1784.8102	1.3446	0.009664	1944.3793	1.4648	cudnn	0.9272
512	5120	640	0.005632	595.7818	1.2509	0.006048	554.8021	1.1649	0.008352	401.7533	0.8435	0.005424	618.6289	1.2989	cute_dsl	1.0383
512	5120	1024	0.006064	885.3412	1.3401	0.006480	828.5684	1.2542	0.009040	593.8838	0.8989	0.005888	911.8052	1.3802	cute_dsl	1.0299
512	5120	1280	0.006656	1008.2462	1.3292	0.007008	957.6037	1.2625	0.009664	694.4212	0.9155	0.006880	975.4195	1.2860	cudnn	0.9674
512	5120	2048	0.008096	1326.2621	1.3599	0.008256	1300.5594	1.3336	0.011456	937.2746	0.9611	0.008848	1213.5418	1.2444	cudnn	0.9150
512	5120	2560	0.009072	1479.4723	1.3726	0.009520	1409.8501	1.3080	0.012352	1086.6073	1.0081	0.009856	1361.7870	1.2634	cudnn	0.9205
512	5120	4096	0.011424	1879.8001	1.4686	0.011856	1811.3054	1.4151	0.015616	1375.1816	1.0744	0.012880	1667.3010	1.3026	cudnn	0.8870
512	5120	5120	0.012736	2107.6905	1.5437	0.013408	2002.0544	1.4663	0.017824	1506.0338	1.1031	0.014688	1827.5834	1.3386	cudnn	0.8671
512	5120	8192	0.017072	2515.7962	1.6584	0.017744	2420.5181	1.5956	0.024592	1746.4896	1.1513	0.020608	2084.1262	1.3738	cudnn	0.8284
512	5120	16384	0.030064	2857.2161	1.7090	0.032432	2648.5985	1.5842	0.044320	1938.1621	1.1593	0.038832	2212.1048	1.3232	cudnn	0.7742
512	7168	5120	0.013472	2789.5609	2.0042	0.014703	2555.9196	1.8364	0.019504	1926.8337	1.3844	0.015920	2360.6133	1.6960	cudnn	0.8462
512	8192	1024	0.006688	1284.3802	1.9206	0.007136	1203.7464	1.8000	0.009968	861.7511	1.2886	0.007200	1193.0465	1.7840	cudnn	0.9289
512	8192	2048	0.008544	2010.7525	2.0250	0.009824	1748.7652	1.7611	0.012911	1330.5866	1.3400	0.009552	1798.5625	1.8113	cudnn	0.8945
512	8192	3584	0.010656	2821.3937	2.2510	0.012272	2449.8673	1.9545	0.016448	1827.8679	1.4583	0.012848	2340.0351	1.8669	cudnn	0.8294
512	8192	4096	0.011552	2974.3541	2.2693	0.013344	2574.9204	1.9645	0.017504	1962.9649	1.4976	0.013760	2497.0740	1.9051	cudnn	0.8395
512	8192	7168	0.015664	3838.7093	2.5271	0.019264	3121.3425	2.0548	0.025792	2331.3253	1.5347	0.021040	2857.8680	1.8814	cudnn	0.7445
512	8192	8192	0.017216	3991.6053	2.5581	0.021360	3217.2040	2.0618	0.028320	2426.4924	1.5551	0.023744	2894.1828	1.8548	cudnn	0.7251
512	8192	14336	0.027599	4357.2921	2.5645	0.034640	3471.6826	2.0433	0.047712	2520.5207	1.4835	0.039392	3052.8809	1.7968	cudnn	0.7006
512	8192	28672	0.053648	4483.2644	2.4823	0.061664	3900.4633	2.1596	0.087280	2755.7077	1.5258	0.070656	3404.0728	1.8848	cudnn	0.7593
512	10240	8192	0.026240	3273.6031	2.0780	0.028192	3046.9405	1.9341	0.040080	2143.2240	1.3604	0.033936	2531.2533	1.6068	cudnn	0.7732
1024	512	7168	0.009184	818.4008	0.7136	0.009984	752.8238	0.6564	0.011856	633.9569	0.5528	0.010160	739.7828	0.6450	cudnn	0.9039
1024	896	1024	0.004352	431.7666	0.6475	0.005088	369.3098	0.5539	0.005056	371.6472	0.5574	0.004480	419.4304	0.6290	cudnn	0.9714
1024	1024	7168	0.009232	1628.2913	1.0222	0.010464	1436.5812	0.9019	0.011904	1262.8012	0.7928	0.010640	1412.8182	0.8870	cudnn	0.8677
1024	4608	7168	0.015072	4488.1724	1.9654	0.017712	3819.2036	1.6724	0.024064	2811.0761	1.2310	0.019072	3546.8611	1.5532	cudnn	0.7903
1024	7168	256	0.005664	663.5057	2.7769	0.006256	600.7187	2.5142	0.010896	344.9061	1.4435	0.005600	671.0886	2.8087	cute_dsl	1.0114
1024	7168	512	0.006464	1162.7773	2.5955	0.006912	1087.4121	2.4273	0.011904	631.4006	1.4094	0.006080	1236.2159	2.7594	cute_dsl	1.0632
1024	7168	4608	0.017440	3878.7692	1.9240	0.018336	3689.2307	1.8300	0.026640	2539.2543	1.2596	0.019424	3482.5852	1.7275	cudnn	0.8979
1024	9216	7168	0.025904	5222.8023	2.1454	0.031344	4316.3435	1.7731	0.045376	2981.5645	1.2248	0.035296	3833.0539	1.5745	cudnn	0.7339

GB200 Autotune

m	n	k	cudnn_time	cudnn_tflops	cudnn_tb_per_sec	cutlass_time	cutlass_tflops	cutlass_tb_per_sec	trtllm_time	trtllm_tflops	trtllm_tb_per_sec	cute_dsl_time	cute_dsl_tflops	cute_dsl_tb_per_sec	best_backend	cute_dsl_vs_best_other_speedup
1	512	7168	0.009760	0.7521	0.1885	0.009088	0.8077	0.2024	0.010944	0.6707	0.1681	0.007712	0.9518	0.2385	cute_dsl	1.1784
1	896	1024	0.004192	0.4377	0.1100	0.004288	0.4279	0.1075	0.004816	0.3810	0.0957	0.004016	0.4569	0.1148	cute_dsl	1.0438
1	896	5120	0.008192	1.1201	0.2805	0.008000	1.1469	0.2873	0.010256	0.8946	0.2241	0.006656	1.3785	0.3453	cute_dsl	1.2019
1	1024	7168	0.010192	1.4404	0.3606	0.009520	1.5420	0.3861	0.011200	1.3107	0.3282	0.008064	1.8204	0.4558	cute_dsl	1.1806
1	1280	8192	0.011200	1.8725	0.4687	0.010399	2.0166	0.5048	0.015072	1.3914	0.3483	0.008768	2.3918	0.5987	cute_dsl	1.1861
1	1792	5120	0.008576	2.1397	0.5356	0.008352	2.1971	0.5500	0.011040	1.6621	0.4161	0.007088	2.5889	0.6481	cute_dsl	1.1783
1	2560	8192	0.012192	3.4402	0.8608	0.011472	3.6561	0.9148	0.015744	2.6641	0.6666	0.009872	4.2487	1.0631	cute_dsl	1.1621
1	3584	5120	0.009328	3.9346	0.9847	0.009360	3.9210	0.9813	0.011568	3.1726	0.7940	0.008128	4.5153	1.1300	cute_dsl	1.1476
1	4608	7168	0.011760	5.6174	1.4054	0.011392	5.7988	1.4508	0.012880	5.1289	1.2832	0.010048	6.5745	1.6449	cute_dsl	1.1338
1	5120	640	0.004576	1.4322	0.3603	0.004624	1.4173	0.3566	0.004736	1.3838	0.3482	0.004544	1.4423	0.3629	cute_dsl	1.0070
1	5120	1024	0.004832	2.1701	0.5447	0.005216	2.0103	0.5046	0.005472	1.9163	0.4810	0.004640	2.2599	0.5673	cute_dsl	1.0414
1	5120	1280	0.005280	2.4824	0.6227	0.005568	2.3540	0.5905	0.005616	2.3339	0.5854	0.004960	2.6426	0.6628	cute_dsl	1.0645
1	5120	2048	0.006448	3.2524	0.8148	0.006896	3.0411	0.7619	0.006752	3.1060	0.7782	0.005792	3.6208	0.9071	cute_dsl	1.1133
1	5120	2560	0.007072	3.7068	0.9283	0.007392	3.5463	0.8881	0.008224	3.1875	0.7983	0.006368	4.1166	1.0310	cute_dsl	1.1106
1	5120	4096	0.008912	4.7064	1.1780	0.008976	4.6728	1.1696	0.009408	4.4582	1.1159	0.007776	5.3939	1.3501	cute_dsl	1.1461
1	5120	5120	0.009792	5.3542	1.3399	0.009792	5.3542	1.3399	0.010512	4.9875	1.2481	0.008592	6.1020	1.5270	cute_dsl	1.1397
1	5120	8192	0.014080	5.9578	1.4905	0.012640	6.6366	1.6603	0.014320	5.8580	1.4655	0.011200	7.4898	1.8737	cute_dsl	1.1286
1	5120	16384	0.024112	6.9580	1.7403	0.020432	8.2112	2.0537	0.025552	6.5659	1.6422	0.018592	9.0239	2.2570	cute_dsl	1.0990
1	7168	256	0.003935	0.9325	0.2368	0.004272	0.8591	0.2182	0.004096	0.8960	0.2275	0.003920	0.9362	0.2377	cute_dsl	1.0040
1	7168	512	0.004352	1.6866	0.4250	0.004704	1.5604	0.3932	0.004576	1.6040	0.4042	0.004256	1.7246	0.4346	cute_dsl	1.0226
1	7168	4608	0.009568	6.9043	1.7278	0.010336	6.3913	1.5994	0.010880	6.0717	1.5195	0.009984	6.6166	1.6558	cudnn	0.9583
1	7168	5120	0.010352	7.0904	1.7742	0.011072	6.6294	1.6589	0.015392	4.7687	1.1933	0.010496	6.9932	1.7499	cudnn	0.9863
1	8192	1024	0.005248	3.1969	0.8024	0.005856	2.8650	0.7191	0.005888	2.8494	0.7152	0.005280	3.1775	0.7976	cudnn	0.9939
1	8192	2048	0.007072	4.7447	1.1886	0.007712	4.3509	1.0900	0.007392	4.5393	1.1372	0.007024	4.7771	1.1968	cute_dsl	1.0068
1	8192	3584	0.008880	6.6126	1.6552	0.009504	6.1785	1.5465	0.010688	5.4940	1.3752	0.009088	6.4613	1.6173	cudnn	0.9771
1	8192	4096	0.009344	7.1820	1.7975	0.009984	6.7216	1.6823	0.011008	6.0964	1.5258	0.009728	6.8985	1.7265	cudnn	0.9605
1	8192	7168	0.013488	8.7070	2.1782	0.014224	8.2565	2.0655	0.016992	6.9115	1.7291	0.013824	8.4954	2.1253	cudnn	0.9757
1	8192	8192	0.014656	9.1579	2.2909	0.015488	8.6659	2.1678	0.022976	5.8416	1.4613	0.015008	8.9431	2.2371	cudnn	0.9765
1	8192	14336	0.022656	10.3673	2.5929	0.025008	9.3922	2.3490	0.031616	7.4292	1.8580	0.024496	9.5885	2.3981	cudnn	0.9249
1	8192	28672	0.040720	11.5364	2.8849	0.040464	11.6094	2.9031	0.055696	8.4344	2.1092	0.038384	12.2385	3.0604	cute_dsl	1.0542
1	9216	7168	0.013808	9.5684	2.3937	0.014592	9.0543	2.2651	0.019088	6.9217	1.7316	0.014144	9.3411	2.3368	cudnn	0.9762
1	10240	8192	0.016752	10.0151	2.5052	0.016848	9.9580	2.4910	0.021216	7.9078	1.9781	0.016416	10.2200	2.5565	cute_dsl	1.0205
4	512	7168	0.009952	2.9502	0.1862	0.009056	3.2421	0.2047	0.011008	2.6672	0.1684	0.007648	3.8389	0.2423	cute_dsl	1.1841
4	896	1024	0.004256	1.7246	0.1100	0.004352	1.6866	0.1075	0.004816	1.5241	0.0972	0.004032	1.8204	0.1161	cute_dsl	1.0556
4	1024	7168	0.010256	5.7255	0.3600	0.009360	6.2735	0.3945	0.013536	4.3381	0.2728	0.008080	7.2674	0.4570	cute_dsl	1.1584
4	4608	7168	0.011808	22.3781	1.4030	0.011360	23.2607	1.4583	0.014784	17.8735	1.1206	0.009952	26.5516	1.6646	cute_dsl	1.1415
4	7168	256	0.003808	3.8551	0.2561	0.004224	3.4754	0.2309	0.004032	3.6409	0.2419	0.003872	3.7913	0.2519	cudnn	0.9835
4	7168	512	0.004352	6.7464	0.4351	0.004672	6.2843	0.4053	0.004704	6.2415	0.4025	0.004224	6.9508	0.4482	cute_dsl	1.0303
4	7168	2304	0.007072	18.6822	1.1764	0.007584	17.4210	1.0970	0.007536	17.5319	1.1040	0.007119	18.5576	1.1685	cudnn	0.9933
4	7168	4608	0.009696	27.2526	1.7102	0.010448	25.2911	1.5871	0.011840	22.3177	1.4005	0.008752	30.1921	1.8946	cute_dsl	1.1079
4	9216	7168	0.013696	38.5866	2.4181	0.014544	36.3368	2.2771	0.018880	27.9916	1.7541	0.013904	38.0094	2.3819	cudnn	0.9850
8	896	5120	0.008064	9.1022	0.2888	0.007808	9.4007	0.2982	0.009008	8.1483	0.2585	0.006640	11.0543	0.3507	cute_dsl	1.1759
8	1280	8192	0.011232	14.9376	0.4715	0.010432	16.0825	0.5077	0.012512	13.4089	0.4233	0.009024	18.5918	0.5869	cute_dsl	1.1560
8	1792	5120	0.008640	16.9908	0.5367	0.008384	17.5096	0.5530	0.011024	13.3165	0.4206	0.007024	20.8999	0.6601	cute_dsl	1.1936
8	2560	8192	0.012064	27.8137	0.8753	0.011392	29.4544	0.9269	0.013184	25.4509	0.8009	0.009888	33.9345	1.0679	cute_dsl	1.1521
8	3584	5120	0.009344	31.4214	0.9902	0.008576	34.2352	1.0789	0.009888	29.6927	0.9358	0.007856	37.3729	1.1778	cute_dsl	1.0916
8	5120	640	0.004864	10.7789	0.3542	0.004640	11.2993	0.3713	0.004768	10.9960	0.3613	0.004384	11.9591	0.3930	cute_dsl	1.0584
8	5120	1024	0.005088	16.4870	0.5321	0.005104	16.4354	0.5305	0.005120	16.3840	0.5288	0.004608	18.2044	0.5876	cute_dsl	1.1042
8	5120	1280	0.005328	19.6805	0.6314	0.005472	19.1626	0.6147	0.005632	18.6182	0.5973	0.004880	21.4872	0.6893	cute_dsl	1.0918
8	5120	2048	0.006560	25.5750	0.8130	0.006288	26.6813	0.8481	0.007408	22.6474	0.7199	0.005728	29.2898	0.9310	cute_dsl	1.0978
8	5120	2560	0.007040	29.7891	0.9440	0.006880	30.4819	0.9660	0.008096	25.9036	0.8209	0.006336	33.0990	1.0489	cute_dsl	1.0859
8	5120	4096	0.008992	37.3159	1.1771	0.008912	37.6508	1.1876	0.009200	36.4722	1.1504	0.007520	44.6203	1.4075	cute_dsl	1.1851
8	5120	5120	0.009984	42.0103	1.3231	0.009664	43.4013	1.3669	0.010464	40.0832	1.2624	0.008192	51.2000	1.6125	cute_dsl	1.1797
8	5120	8192	0.013904	48.2659	1.5166	0.012896	52.0385	1.6351	0.014272	47.0213	1.4775	0.011216	59.8332	1.8800	cute_dsl	1.1498
8	5120	16384	0.023424	57.2992	1.7969	0.020544	65.3318	2.0488	0.024784	54.1550	1.6983	0.017616	76.1908	2.3893	cute_dsl	1.1662
8	7168	5120	0.010144	57.8867	1.8223	0.010912	53.8126	1.6940	0.012960	45.3088	1.4263	0.008768	66.9711	2.1083	cute_dsl	1.1569
8	8192	1024	0.005216	25.7319	0.8300	0.005728	23.4319	0.7558	0.005856	22.9197	0.7393	0.005056	26.5462	0.8563	cute_dsl	1.0316
8	8192	2048	0.007072	37.9575	1.2059	0.007680	34.9525	1.1104	0.007328	36.6315	1.1637	0.006320	42.4740	1.3493	cute_dsl	1.1190
8	8192	3584	0.008736	53.7731	1.6971	0.009376	50.1026	1.5812	0.009856	47.6625	1.5042	0.007680	61.1669	1.9304	cute_dsl	1.1375
8	8192	4096	0.009536	56.2994	1.7748	0.009984	53.7731	1.6952	0.013456	39.8983	1.2578	0.008304	64.6560	2.0383	cute_dsl	1.1484
8	8192	7168	0.013472	69.7390	2.1912	0.014320	65.6092	2.0614	0.017136	54.8275	1.7227	0.011616	80.8854	2.5414	cute_dsl	1.1598
8	8192	8192	0.014784	72.6286	2.2807	0.015712	68.3390	2.1460	0.017168	62.5432	1.9640	0.012400	86.5921	2.7192	cute_dsl	1.1923
8	8192	14336	0.022432	83.7664	2.6261	0.024608	76.3592	2.3939	0.028304	66.3881	2.0813	0.019008	98.8556	3.0992	cute_dsl	1.1801
8	8192	28672	0.040128	93.6527	2.9328	0.039664	94.7483	2.9671	0.053392	70.3869	2.2042	0.032352	116.1627	3.6377	cute_dsl	1.2260
8	10240	8192	0.016656	80.5822	2.5300	0.016752	80.1228	2.5156	0.020560	65.2826	2.0496	0.014128	95.0012	2.9827	cute_dsl	1.1789
16	512	7168	0.009856	11.9156	0.1937	0.009152	12.8322	0.2086	0.010976	10.6998	0.1739	0.007616	15.4202	0.2506	cute_dsl	1.2017
16	896	1024	0.004176	7.0307	0.1187	0.004288	6.8470	0.1156	0.004928	5.9578	0.1006	0.003936	7.4594	0.1259	cute_dsl	1.0610
16	1024	7168	0.010272	22.8661	0.3661	0.009504	24.7139	0.3956	0.011295	20.7942	0.3329	0.007968	29.4780	0.4719	cute_dsl	1.1928
16	4608	7168	0.011856	89.1502	1.4102	0.011312	93.4375	1.4781	0.012689	83.3010	1.3177	0.009664	109.3713	1.7301	cute_dsl	1.1705
16	7168	256	0.003904	15.0410	0.2943	0.004368	13.4433	0.2630	0.004112	14.2802	0.2794	0.003744	15.6838	0.3069	cute_dsl	1.0427
16	7168	512	0.004288	27.3882	0.4824	0.004672	25.1371	0.4427	0.004688	25.0513	0.4412	0.004176	28.1227	0.4953	cute_dsl	1.0268
16	7168	2304	0.007520	70.2769	1.1310	0.007648	69.1007	1.1121	0.007488	70.5772	1.1359	0.006304	83.8329	1.3492	cute_dsl	1.1878
16	7168	4608	0.009920	106.5489	1.6917	0.010608	99.6384	1.5819	0.011168	94.6422	1.5026	0.008640	122.3339	1.9423	cute_dsl	1.1481
16	9216	7168	0.013632	155.0711	2.4488	0.014480	145.9896	2.3054	0.017503	120.7718	1.9072	0.011872	178.0601	2.8119	cute_dsl	1.1482
64	512	7168	0.009952	47.2028	0.2140	0.008976	52.3353	0.2373	0.011120	42.2448	0.1915	0.007696	61.0398	0.2768	cute_dsl	1.1663
64	896	1024	0.004320	27.1853	0.1403	0.004384	26.7884	0.1383	0.004864	24.1448	0.1246	0.003872	30.3307	0.1566	cute_dsl	1.1157
64	896	5120	0.008208	71.5403	0.3134	0.007888	74.4425	0.3261	0.009184	63.9376	0.2801	0.006688	87.7994	0.3846	cute_dsl	1.1794
64	1280	8192	0.011120	120.6994	0.5098	0.010192	131.6893	0.5562	0.012704	105.6500	0.4462	0.008784	152.7980	0.6454	cute_dsl	1.1603
64	1792	5120	0.008720	134.6795	0.5712	0.008480	138.4912	0.5874	0.011312	103.8194	0.4403	0.007248	162.0316	0.6872	cute_dsl	1.1700
64	2560	8192	0.012049	222.7957	0.9193	0.011488	233.6660	0.9641	0.013184	203.6070	0.8401	0.009888	271.4760	1.1201	cute_dsl	1.1618
64	3584	5120	0.009488	247.5559	1.0326	0.008608	272.8795	1.1383	0.011696	200.8217	0.8377	0.007872	298.3753	1.2446	cute_dsl	1.0934
64	4608	7168	0.011872	356.1202	1.4601	0.011328	373.2220	1.5302	0.014912	283.5205	1.1624	0.009696	436.0415	1.7878	cute_dsl	1.1683
64	5120	640	0.004624	90.7073	0.5005	0.004608	91.0222	0.5022	0.004896	85.6680	0.4727	0.004304	97.4513	0.5377	cute_dsl	1.0706
64	5120	1024	0.004928	136.1787	0.6716	0.004992	134.4328	0.6630	0.005312	126.3345	0.6230	0.004544	147.6868	0.7283	cute_dsl	1.0845
64	5120	1280	0.005296	158.3952	0.7502	0.005344	156.9725	0.7435	0.005840	143.6405	0.6803	0.004880	171.8977	0.8142	cute_dsl	1.0852
64	5120	2048	0.006560	204.6002	0.9091	0.006272	213.9951	0.9509	0.006976	192.3993	0.8549	0.005728	234.3187	1.0412	cute_dsl	1.0950
64	5120	2560	0.007360	227.9513	0.9906	0.006880	243.8549	1.0597	0.008304	202.0378	0.8780	0.006224	269.5568	1.1714	cute_dsl	1.1054
64	5120	4096	0.009024	297.4684	1.2491	0.008272	324.5109	1.3627	0.009376	286.3006	1.2022	0.007584	353.9497	1.4863	cute_dsl	1.0907
64	5120	5120	0.009856	340.4468	1.4130	0.009728	344.9263	1.4316	0.010624	315.8361	1.3108	0.008192	409.6000	1.7000	cute_dsl	1.1875
64	5120	8192	0.013455	398.9974	1.6268	0.012944	414.7643	1.6911	0.014528	369.5422	1.5067	0.010880	493.4475	2.0119	cute_dsl	1.1897
64	5120	16384	0.023072	465.3975	1.8691	0.020640	520.2238	2.0893	0.024864	431.8460	1.7343	0.018400	583.5553	2.3436	cute_dsl	1.1217
64	7168	256	0.003968	59.1938	0.4645	0.004288	54.7764	0.4299	0.004320	54.3706	0.4267	0.003680	63.8264	0.5009	cute_dsl	1.0783
64	7168	512	0.004352	107.9416	0.6362	0.004672	100.5484	0.5927	0.004704	99.8644	0.5886	0.004048	116.0479	0.6840	cute_dsl	1.0751
64	7168	2304	0.007040	300.2740	1.3137	0.007088	298.2406	1.3048	0.007472	282.9134	1.2378	0.006256	337.9043	1.4784	cute_dsl	1.1253
64	7168	4608	0.009920	426.1954	1.7722	0.010575	399.7786	1.6623	0.010688	395.5706	1.6448	0.008352	506.2091	2.1049	cute_dsl	1.1877
64	7168	5120	0.010400	451.6943	1.8684	0.011008	426.7460	1.7652	0.011168	420.6322	1.7399	0.008944	525.2259	2.1726	cute_dsl	1.1628
64	8192	1024	0.005504	195.0839	0.9585	0.005760	186.4135	0.9159	0.005760	186.4135	0.9159	0.004992	215.0925	1.0568	cute_dsl	1.1026
64	8192	2048	0.007200	298.2616	1.3198	0.007744	277.3094	1.2271	0.007424	289.2623	1.2800	0.006336	338.9337	1.4998	cute_dsl	1.1364
64	8192	3584	0.009120	412.0720	1.7372	0.009584	392.1219	1.6531	0.009488	396.0894	1.6698	0.007744	485.2914	2.0459	cute_dsl	1.1777
64	8192	4096	0.009696	442.9628	1.8520	0.010208	420.7452	1.7591	0.011600	370.2558	1.5480	0.008560	501.7485	2.0978	cute_dsl	1.1327
64	8192	7168	0.013520	555.9314	2.2661	0.014096	533.2146	2.1735	0.016880	445.2721	1.8151	0.011440	657.0099	2.6782	cute_dsl	1.1818
64	8192	8192	0.015024	571.7475	2.3206	0.015424	556.9200	2.2604	0.015904	540.1116	2.1922	0.012032	713.9241	2.8977	cute_dsl	1.2487
64	8192	14336	0.022831	658.4055	2.6379	0.025360	592.7597	2.3749	0.025920	579.9531	2.3236	0.019504	770.7335	3.0880	cute_dsl	1.1706
64	8192	28672	0.039552	760.1424	3.0190	0.039599	759.2210	3.0154	0.047808	628.8649	2.4976	0.032192	933.9206	3.7092	cute_dsl	1.2286
64	9216	7168	0.014256	593.1339	2.4158	0.014576	580.1123	2.3627	0.015520	544.8271	2.2190	0.011920	709.3722	2.8892	cute_dsl	1.1960
64	10240	8192	0.016432	653.4456	2.6482	0.016992	631.9102	2.5610	0.018287	587.1452	2.3795	0.013728	782.1546	3.1699	cute_dsl	1.1970
256	512	7168	0.010096	186.1181	0.2986	0.009504	197.7113	0.3172	0.010896	172.4530	0.2767	0.007776	241.6471	0.3877	cute_dsl	1.2222
256	896	1024	0.004368	107.5463	0.2401	0.004432	105.9932	0.2366	0.005216	90.0617	0.2010	0.004000	117.4405	0.2621	cute_dsl	1.0920
256	1024	7168	0.008736	430.1850	0.5851	0.009824	382.5424	0.5203	0.011584	324.4353	0.4413	0.007984	470.7035	0.6403	cute_dsl	1.0942
256	4608	7168	0.010592	1596.6233	1.8686	0.011456	1476.2076	1.7276	0.015552	1087.4121	1.2726	0.010464	1616.1538	1.8914	cute_dsl	1.0122
256	7168	256	0.004160	225.8471	1.1106	0.004512	208.2279	1.0240	0.004560	206.0360	1.0132	0.004128	227.5979	1.1193	cute_dsl	1.0078
256	7168	512	0.004640	404.9673	1.2006	0.004896	383.7925	1.1378	0.005056	371.6472	1.1018	0.004479	419.4772	1.2436	cute_dsl	1.0358
256	7168	2304	0.007040	1201.0961	1.7361	0.007744	1091.9056	1.5783	0.007904	1069.8022	1.5464	0.007248	1166.6276	1.6863	cudnn	0.9713
256	7168	4608	0.009184	1841.4018	2.2621	0.010000	1691.1434	2.0775	0.011184	1512.1096	1.8576	0.009408	1797.5589	2.2082	cudnn	0.9762
256	9216	7168	0.012736	2655.6900	3.0360	0.013728	2463.7870	2.8166	0.016224	2084.7428	2.3833	0.013056	2590.5995	2.9616	cudnn	0.9755
512	896	5120	0.007424	632.7614	0.6091	0.008144	576.8198	0.5553	0.009248	507.9607	0.4890	0.006800	690.8265	0.6650	cute_dsl	1.0918
512	1280	8192	0.011552	929.4857	0.7489	0.011072	969.7813	0.7813	0.012640	849.4793	0.6844	0.009216	1165.0844	0.9387	cute_dsl	1.2014
512	1792	5120	0.007856	1195.9319	0.9844	0.008688	1081.4043	0.8901	0.009680	970.5827	0.7989	0.007552	1244.0732	1.0240	cute_dsl	1.0403
512	2560	8192	0.010784	1991.3609	1.4099	0.011824	1816.2074	1.2859	0.015648	1372.3694	0.9716	0.011264	1906.5018	1.3498	cudnn	0.9574
512	3584	5120	0.008928	2104.6687	1.5855	0.009856	1906.5018	1.4363	0.011872	1582.7562	1.1924	0.009088	2067.6146	1.5576	cudnn	0.9824
512	5120	640	0.005568	602.6299	1.2653	0.006048	554.8021	1.1649	0.008352	401.7533	0.8435	0.005344	627.8898	1.3183	cute_dsl	1.0419
512	5120	1024	0.006016	892.4051	1.3508	0.006496	826.4638	1.2510	0.007296	735.8428	1.1138	0.006512	824.4332	1.2479	cudnn	0.9238
512	5120	1280	0.006592	1018.0350	1.3421	0.006976	961.9963	1.2683	0.009600	699.0507	0.9216	0.006464	1038.1941	1.3687	cute_dsl	1.0198
512	5120	2048	0.007696	1395.1947	1.4306	0.008192	1310.7200	1.3440	0.009888	1085.9039	1.1135	0.007808	1375.1816	1.4101	cudnn	0.9857
512	5120	2560	0.008512	1576.8060	1.4629	0.009504	1412.2236	1.3102	0.011056	1213.9809	1.1263	0.008672	1547.7137	1.4359	cudnn	0.9815
512	5120	4096	0.010512	2042.8878	1.5960	0.011616	1848.7290	1.4443	0.014271	1504.7358	1.1756	0.010944	1962.2475	1.5330	cudnn	0.9605
512	5120	5120	0.011696	2295.1048	1.6810	0.012640	2123.6982	1.5554	0.016576	1619.4224	1.1861	0.012320	2178.8592	1.5958	cudnn	0.9494
512	5120	8192	0.015616	2750.3633	1.8130	0.016400	2618.8825	1.7263	0.025568	1679.8214	1.1073	0.016880	2544.4119	1.6772	cudnn	0.9251
512	5120	16384	0.027232	3154.3532	1.8868	0.026720	3214.7959	1.9229	0.044480	1931.1903	1.1551	0.024576	3495.2533	2.0907	cute_dsl	1.0872
512	7168	5120	0.012528	2999.7577	2.1552	0.013344	2816.3192	2.0234	0.018048	2082.2786	1.4961	0.011936	3148.5392	2.2621	cute_dsl	1.0496
512	8192	1024	0.006640	1293.6648	1.9345	0.007104	1209.1687	1.8081	0.008224	1044.4959	1.5619	0.006304	1362.6165	2.0376	cute_dsl	1.0533
512	8192	2048	0.008496	2022.1127	2.0364	0.009632	1783.6243	1.7963	0.011232	1529.5468	1.5404	0.009120	1883.7576	1.8971	cudnn	0.9316
512	8192	3584	0.010608	2834.1602	2.2611	0.011616	2588.2207	2.0649	0.014976	2007.5301	1.6016	0.011872	2532.4100	2.0204	cudnn	0.8935
512	8192	4096	0.011392	3016.1287	2.3011	0.012416	2767.3758	2.1113	0.016064	2138.9279	1.6319	0.012752	2694.4588	2.0557	cudnn	0.8934
512	8192	7168	0.015632	3846.5674	2.5322	0.017664	3404.0728	2.2409	0.024368	2467.6123	1.6244	0.016704	3599.7092	2.3697	cudnn	0.9358
512	8192	8192	0.017087	4021.6226	2.5773	0.018816	3652.1831	2.3406	0.026976	2547.4302	1.6326	0.020208	3400.6917	2.1794	cudnn	0.8456
512	8192	14336	0.027808	4324.6218	2.5453	0.030432	3951.7312	2.3258	0.046752	2572.2768	1.5139	0.029008	4145.7213	2.4400	cudnn	0.9586
512	8192	28672	0.054192	4438.2597	2.4574	0.050496	4763.1133	2.6372	0.086704	2774.0307	1.5359	0.046480	5174.6037	2.8651	cute_dsl	1.0864
512	10240	8192	0.026336	3261.6702	2.0704	0.027472	3126.7962	1.9848	0.036784	2335.2367	1.4823	0.024880	3452.5461	2.1916	cute_dsl	1.0585
1024	512	7168	0.009184	818.4008	0.7136	0.009968	754.0322	0.6575	0.011904	631.4006	0.5505	0.008144	922.9117	0.8047	cute_dsl	1.1277
1024	896	1024	0.004384	428.6150	0.6428	0.004768	394.0957	0.5910	0.005088	369.3098	0.5539	0.004192	448.2462	0.6722	cute_dsl	1.0458
1024	1024	7168	0.009376	1603.2834	1.0065	0.010416	1443.2014	0.9060	0.011968	1256.0483	0.7885	0.009008	1668.7817	1.0476	cute_dsl	1.0409
1024	4608	7168	0.015392	4394.8632	1.9245	0.016800	4026.5318	1.7632	0.022144	3054.8110	1.3377	0.016192	4177.7257	1.8294	cudnn	0.9506
1024	7168	256	0.005680	661.6367	2.7691	0.006128	613.2664	2.5667	0.010816	347.4571	1.4542	0.005472	686.7866	2.8744	cute_dsl	1.0380
1024	7168	512	0.006368	1180.3067	2.6346	0.006880	1092.4699	2.4385	0.009472	793.5170	1.7712	0.006096	1232.9713	2.7522	cute_dsl	1.0446
1024	7168	4608	0.017616	3840.0167	1.9048	0.018208	3715.1656	1.8428	0.022192	3048.2036	1.5120	0.016288	4153.1026	2.0601	cute_dsl	1.0815
1024	9216	7168	0.026128	5178.0262	2.1270	0.028367	4769.2419	1.9591	0.040896	3308.1834	1.3589	0.026768	5054.2241	2.0762	cudnn	0.9761

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- Added "cute-dsl" as a supported FP4 backend, integrated into FP4 autotuning and runtime selection (PDL-enabled by default).
- Added an SM100 persistent block-scaled dense GEMM kernel to accelerate FP4 workloads on SM100/SM103.
Tests
- Updated FP4 tests to include "cute-dsl" and added GPU/gating checks for nvfp4, 128x4 layout, and SM100/SM103.

coderabbitai · 2026-02-11T19:49:49Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds "cute-dsl" as a new mm_fp4 FP4 GEMM backend: benchmarks and tests updated; mm_fp4 dispatch extended with CuTe DSL availability checks, runner factory entry, and a kernel cache; and a new SM100 block‑scaled persistent GEMM kernel was added.

Changes

Cohort / File(s)	Summary
Benchmarks & Tests `benchmarks/routines/gemm.py`, `tests/gemm/test_mm_fp4.py`	Added `"cute-dsl"` to backend CLI parsing, autotune gating, and test parameterization; test flow includes GPU capability and 128x4 SF layout gating for `"cute-dsl"`.
FP4 GEMM Dispatch & Integration `flashinfer/gemm/gemm_base.py`	Integrated `"cute-dsl"` into mm_fp4 API and literals: added availability check `_cute_dsl_gemm_fp4_requirement`, runner factory `_cute_dsl_gemm_fp4_runner`, module kernel cache `_CUTE_DSL_MM_FP4_KERNEL_CACHE`, updated signatures to accept `enable_pdl`, and wired `"cute-dsl"` runner entry.
SM100 Block‑Scaled GEMM Kernel `flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py`	Added `Sm100BlockScaledPersistentDenseGemmKernel`: a persistent‑tile, block‑scaled dense GEMM kernel for SM100 with configuration, validation, staged data movement, PDL-aware epilogue, wrapper, and callable launch interfaces.

Sequence Diagram

sequenceDiagram
    participant User as User
    participant MM as mm_fp4
    participant Disp as Dispatcher
    participant Req as Requirement\r\n(_cute_dsl_gemm_fp4_requirement)
    participant Run as Runner\r\n(_cute_dsl_gemm_fp4_runner)
    participant Cache as Kernel\r\nCache
    participant Kernel as CuTeDSL\r\nKernel

    User->>MM: mm_fp4(..., backend="cute-dsl", enable_pdl=...)
    MM->>Disp: select backend runner
    Disp->>Req: validate availability & constraints
    alt invalid
        Req-->>MM: raise/skip
    else valid
        Disp->>Run: create/obtain runner
        Run->>Cache: lookup compiled kernel by config
        alt cached
            Cache-->>Run: return kernel
        else not cached
            Run->>Kernel: compile kernel
            Kernel-->>Cache: store compiled kernel
            Cache-->>Run: return kernel
        end
        Run-->>Disp: runner instance
        Disp->>Kernel: execute kernel with tensors
        Kernel-->>User: results
    end

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

refactor: Port upstream CUTLASS fixes and refactor grouped_gemm_nt_masked GEMM module location #2503: Overlapping CuTe‑DSL GEMM integration and SM100 kernel additions.
[WIP] Refactor: simplify torch -> cute-dsl boilerplate and enable tvm-ffi for cute-dsl kernels #2279: Related CuTe‑DSL backend enablement and TVM‑FFI/tensor interface refactors.
enable sm103 moe dsl backend #2149: Related FP4/CuTe‑DSL backend enhancements for SM100/SM103.

Suggested labels

op: gemm

Suggested reviewers

nvmbreughe
jiahanc
Anerudhan
yongwww
cyx-6
jimmyzho

Poem

🐰 I hopped through kernels, quick and bright,
Cute‑DSL brought FP4 into light.
Tiles persist, scale factors align,
Cached and ready — kernels shine.
Rabbity cheers, a carrot byte delight.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 63.64% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: cute dsl mmfp4 for blackwell' clearly describes the main change: adding CuTeDSL support for mm_fp4 on Blackwell GPUs.
Description check	✅ Passed	The description includes related issue reference (`#2466`), provides context on file origins (TensorRT-LLM, CUTLASS), explains the main changes (integration of CuTeDSL backend, wrapper logic, test/benchmark updates), and includes comprehensive performance data tables. However, the PR checklist items are not checked off.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-02-11T19:50:06Z

Summary of Changes

Hello @nv-yunzheq, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces cute_dsl as an additional backend for mm_fp4 operations, aiming to leverage advanced GPU architectural features for improved performance. The integration includes a new kernel specifically designed for SM100 GPUs, with a placeholder for SM103. Benchmarking data indicates that the cute_dsl backend can offer significant speedups over existing cudnn, cutlass, and trtllm implementations, particularly when autotuning is enabled.

Highlights

New Backend Integration: The cute_dsl backend has been integrated for mm_fp4 operations, leveraging advanced GPU architectural features.
SM100 Kernel Implementation: A new SM100 block-scaled persistent dense GEMM kernel has been implemented to support the cute_dsl backend.
Performance Improvements: Performance benchmarks indicate that the cute_dsl backend often outperforms existing cudnn, cutlass, and trtllm implementations, especially with autotuning enabled.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

benchmarks/routines/gemm.py
- Added cute_dsl as a selectable kernel backend for testing
- Included cute_dsl in the list of autotune-supported backends
flashinfer/gemm/gemm_base.py
- Updated mm_fp4 function signatures and literal types to include cute_dsl as a valid backend option
- Introduced a new enable_pdl parameter for the cute_dsl backend
- Implemented _cute_dsl_gemm_fp4_requirement and _cute_dsl_gemm_fp4_runner functions to handle cute_dsl specific logic, including kernel caching and dynamic compilation
- Added cute_dsl to the backend_to_runner_factory mapping
flashinfer/gemm/kernels/cute_dsl_gemm_utils.py
- Added a new file containing shared utilities for CuTe DSL dense block-scaled GEMM kernels
- Included PDL (Programmatic Dependent Launch) helpers (griddepcontrol_wait, griddepcontrol_launch_dependents)
- Provided a custom make_ptr utility for CuTe DSL's JIT compilation
- Implemented PipelineTmaUmma and PipelineUmmaAsync classes for managing asynchronous data transfer and accumulation
flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py
- Added a new file implementing the Sm100BlockScaledPersistentDenseGemmKernel
- This kernel supports batched matrix multiplication with FP4 data types, leveraging Blackwell GPU features like persistent tile scheduling and warp specialization
- Includes logic for TMA (Tensor Memory Access) producers, UMMA (Universal Matrix Multiply Accumulate) consumers, and shared memory management
tests/gemm/test_mm_fp4.py
- Extended the _test_mm_fp4 function to include cute_dsl in the backend test parameters
- Added specific pytest.skip conditions for the cute_dsl backend, ensuring it only runs with nvfp4, 128x4 SF layout, and SM100/SM103 GPUs

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request integrates cute_dsl as a new backend for mm_fp4, which is a significant and valuable addition. The changes are well-structured, introducing new, high-performance kernels ported from NVIDIA's libraries and integrating them consistently with the existing backend infrastructure. The new code is complex but appears to be of high quality. I've identified one potential issue in the autotuner's tactic generation logic where an alignment check seems to be incorrect, which could lead to suboptimal kernel selection. Overall, this is an excellent contribution that should improve FP4 GEMM performance.

gemini-code-assist · 2026-02-11T19:53:40Z

+                        if swap_ab and not m_aligned:
+                            continue


The alignment check for the output matrix C when swap_ab is true appears to be incorrect. When swap_ab is true, the kernel computes B.T @ A.T, and the output is effectively a column-major matrix of shape (n, m). The contiguous dimension in memory is along the columns, which corresponds to the problem's n dimension. Therefore, the alignment check should be on n (n_aligned), not m (m_aligned). This incorrect pruning might exclude valid and potentially optimal kernel configurations.

Suggested change

if swap_ab and not m_aligned:

continue

if swap_ab and not n_aligned:

continue

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@flashinfer/gemm/gemm_base.py`:
- Around line 3068-3173: The cache key used to index _CUTE_DSL_KERNEL_CACHE must
include the device identity to avoid reusing device-specific compiled kernels
across GPUs; modify the construction of cache_key (the tuple currently
containing sf_vec_size, mma_tiler_mn, cluster_shape_mn, swap_ab, use_prefetch,
kernel_type, use_tma_store, enable_pdl, out_dtype) to also incorporate the
executing device (derive from kernel_a.device — include device.type and
device.index (or a stable sentinel like -1 if index is None)), and use that
augmented cache_key when reading/writing _CUTE_DSL_KERNEL_CACHE for
compiled_gemm and max_active_clusters so the lookup/store around compiled_gemm
and max_active_clusters becomes device-aware.
- Around line 3176-3192: The kernel assumes row-major memory when swap_ab=True
but launch_out is set to the non-contiguous view out.T; change the launch path
so the kernel receives a contiguous buffer with the expected layout: when
swap_ab is True, allocate a temporary contiguous tensor with the row-major
layout (or call out.clone().contiguous()) into which the kernel will write (this
is the launch_out passed to the kernel), then after the kernel completes copy
the results back into the original out via the appropriate transpose (e.g.,
out.copy_(temp.T)) and free the temp; alternatively, ensure out is originally
allocated with the layout expected by cute.make_ordered_layout so no transpose
view is used. Ensure this change is applied around the launch_out assignment and
kernel invocation that uses swap_ab and interacts with cute.make_ordered_layout.

In `@flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py`:
- Around line 1655-1658: The docstring documents a non-existent parameter "sepi"
alongside tTR_rC, tidx, and sC; remove the stale "sepi (cute.Tensor):" entry
from the function's docstring (the block describing tTR_rC, tidx, sC) so the
parameter list matches the actual function signature and leave only real
parameters (e.g., tTR_rC, tidx, sC).
- Around line 2017-2021: Rename the helper function
check_contigous_16B_alignment to check_contiguous_16B_alignment and update all
call sites that invoke it (the three places currently calling
check_contigous_16B_alignment) to use the new name; ensure the function
signature (dtype, is_mode0_major, tensor_shape) remains unchanged so callers
still pass the same arguments and behavior is preserved.

🧹 Nitpick comments (4)

flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py (4)
648-648: Nit: prefix unused unpacked variables with _.

bidy and bidz are never referenced. Prefixing them with _ silences linter warnings and signals intent.
Proposed fix
-        bidx, bidy, bidz = cute.arch.block_idx()
+        bidx, _bidy, _bidz = cute.arch.block_idx()
1677-1719: Unused parameter tidx.

tidx is accepted but never referenced inside epilog_gmem_copy_and_partition. If it's kept for API symmetry with the other epilog_*_copy_and_partition methods, consider documenting that intent. Otherwise, remove it.

1721-1829: Unused parameters a_major_mode and b_major_mode in _compute_stages.

These are passed through from the caller but never referenced in the method body. They appear to be placeholders—consider removing them or adding a comment if they're reserved for future heuristics.

Also note: num_ab_stage (line 1816) has no lower-bound clamp. If shared memory is under-provisioned for a given configuration, this could yield ≤ 1 stages, breaking double-buffered pipeline semantics. The upstream can_implement validation likely guards against this in practice, but a defensive max(num_ab_stage, 2) would be safer.

1923-1947: Unused parameters c_dtype and c_major in is_valid_layouts.

These parameters are accepted but never checked. If all C layouts are valid, remove them from the signature (and update callers). If C-layout validation is planned, consider adding a # TODO to track it.

coderabbitai · 2026-02-11T20:02:22Z

+            cache_key = (
+                sf_vec_size,
+                mma_tiler_mn,
+                cluster_shape_mn,
+                swap_ab,
+                use_prefetch,
+                kernel_type,
+                use_tma_store,
+                enable_pdl,
+                out_dtype,
+            )
+
+            if cache_key not in _CUTE_DSL_KERNEL_CACHE:
+                # Create kernel instance
+                if kernel_type == "sm103" and Sm103Kernel is not None:
+                    gemm = Sm103Kernel(  # type: ignore[assignment]
+                        sf_vec_size,
+                        mma_tiler_mn,
+                        cluster_shape_mn,
+                        use_tma_store,
+                        enable_pdl,
+                    )
+                else:
+                    gemm = Sm100BlockScaledPersistentDenseGemmKernel(  # type: ignore[assignment]
+                        sf_vec_size,
+                        mma_tiler_mn,
+                        cluster_shape_mn,
+                        use_prefetch,
+                        enable_pdl,
+                    )
+
+                # Create CuTe pointers for compilation
+                a_ptr = make_ptr(
+                    cutlass.Float4E2M1FN,
+                    kernel_a.data_ptr(),
+                    cute.AddressSpace.gmem,
+                    32,
+                )
+                b_ptr = make_ptr(
+                    cutlass.Float4E2M1FN,
+                    kernel_b.data_ptr(),
+                    cute.AddressSpace.gmem,
+                    32,
+                )
+                a_sf_ptr = make_ptr(
+                    cutlass.Float8E4M3FN,
+                    kernel_a_sf.data_ptr(),
+                    cute.AddressSpace.gmem,
+                    16,
+                )
+                b_sf_ptr = make_ptr(
+                    cutlass.Float8E4M3FN,
+                    kernel_b_sf.data_ptr(),
+                    cute.AddressSpace.gmem,
+                    16,
+                )
+                c_ptr = make_ptr(
+                    c_cutlass_dtype, out.data_ptr(), cute.AddressSpace.gmem, 16
+                )
+
+                # Alpha: ensure 1-dim shape [1] for consistent TVM FFI compilation
+                if alpha_tensor is not None:
+                    alpha_compile = (
+                        alpha_tensor.reshape(1)
+                        if alpha_tensor.dim() != 1
+                        else alpha_tensor
+                    )
+                    alpha_cute = cute.runtime.from_dlpack(alpha_compile)
+                else:
+                    alpha_cute = cute.runtime.from_dlpack(
+                        torch.tensor([1.0], dtype=torch.float32, device=a.device)
+                    )
+
+                # Get max active clusters
+                from flashinfer.cute_dsl.utils import get_max_active_clusters
+
+                max_active_clusters = get_max_active_clusters(
+                    cluster_shape_mn[0] * cluster_shape_mn[1]
+                )
+
+                # Create a fake stream for compilation
+                stream = cute.runtime.make_fake_stream(use_tvm_ffi_env_stream=True)
+
+                compiled_gemm = cute.compile(
+                    gemm.wrapper,
+                    kernel_m,
+                    kernel_n,
+                    real_k,
+                    sf_m,
+                    sf_n,
+                    sf_k,
+                    batch_size,
+                    a_ptr,
+                    b_ptr,
+                    a_sf_ptr,
+                    b_sf_ptr,
+                    c_ptr,
+                    alpha_cute,
+                    max_active_clusters,
+                    stream,
+                    swap_ab,
+                    options="--opt-level 2 --enable-tvm-ffi",
+                )
+
+                _CUTE_DSL_KERNEL_CACHE[cache_key] = (compiled_gemm, max_active_clusters)
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's understand the file structure and locate the relevant code head -20 flashinfer/gemm/gemm_base.py wc -l flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 795

🏁 Script executed:

# Look at the specific section mentioned in the review sed -n '3060,3180p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 4434

🏁 Script executed:

# Find where _CUTE_DSL_KERNEL_CACHE is defined and used rg "_CUTE_DSL_KERNEL_CACHE" -n flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 343

🏁 Script executed:

# Check if sm_version is used or available in this context sed -n '3000,3200p' flashinfer/gemm/gemm_base.py | rg -n "sm_version|device|SM"

Repository: flashinfer-ai/flashinfer

Length of output: 305

🏁 Script executed:

# Look for the broader function context to understand the function signature sed -n '3000,3100p' flashinfer/gemm/gemm_base.py | head -100

Repository: flashinfer-ai/flashinfer

Length of output: 3747

🏁 Script executed:

# Check the get_max_active_clusters function rg "def get_max_active_clusters" -A 10 flashinfer/

Repository: flashinfer-ai/flashinfer

Length of output: 737

🏁 Script executed:

# Look at kernel_type selection logic and if it's based on SM version sed -n '3000,3070p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 2689

🏁 Script executed:

# Check if enable_pdl and out_dtype are derived from device/SM context sed -n '2980,3030p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 2247

🏁 Script executed:

# Search for any SM/device version detection or multi-device handling rg "device\." flashinfer/gemm/gemm_base.py | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Look for where kernel_type is determined (before the forward call) sed -n '2900,3000p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 4439

🏁 Script executed:

# Find where sm_version is defined in this method/class context sed -n '2850,2920p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 2423

🏁 Script executed:

# Check the broader class context for where enable_pdl and sm_version come from sed -n '2780,2850p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 2532

🏁 Script executed:

# Look at get_hardware_info to see what it returns rg "def get_hardware_info" -A 15 flashinfer/

Repository: flashinfer-ai/flashinfer

Length of output: 1082

🏁 Script executed:

# Check if max_active_clusters is SM-version-dependent rg "get_max_active_clusters" -B 5 -A 5 flashinfer/cute_dsl/

Repository: flashinfer-ai/flashinfer

Length of output: 1263

🏁 Script executed:

# Verify if the cache is truly module-level and shared across function calls sed -n '2775,2850p' flashinfer/gemm/gemm_base.py | head -80

Repository: flashinfer-ai/flashinfer

Length of output: 2695

🏁 Script executed:

# Check if there's any device context management or stream handling that might isolate cache usage sed -n '3100,3180p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 2978

🏁 Script executed:

# Look for multi-GPU test cases or documentation find flashinfer -name "*.py" -type f | xargs grep -l "multi.*gpu\|device.*index\|cuda.device" | head -5

Repository: flashinfer-ai/flashinfer

Length of output: 213

🏁 Script executed:

# Check if kernel_type "sm100"/"sm103" is truly the only SM-specific differentiator sed -n '2810,2830p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 827

🏁 Script executed:

# Check how the runner is instantiated and if it's per-device or shared sed -n '2778,2790p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 444

🏁 Script executed:

# Look for where _cute_dsl_gemm_fp4_runner is called to understand scope rg "_cute_dsl_gemm_fp4_runner" -B 3 -A 3 flashinfer/

Repository: flashinfer-ai/flashinfer

Length of output: 835

🏁 Script executed:

# Verify the exact scope where the class is created and cached rg "CuteDSLFp4GemmRunner" -B 5 -A 5 flashinfer/

Repository: flashinfer-ai/flashinfer

Length of output: 1324

🏁 Script executed:

# Check if there's any per-device runner instance management rg "get_device_properties\|device.*index\|device.*type" flashinfer/gemm/gemm_base.py | head -10

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Check how the runner is actually used and if devices could differ rg "CuteDSLFp4GemmRunner\|cute_dsl.*runner" -B 5 -A 5 flashinfer/gemm/ | head -50

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Look for any device context management in forward method sed -n '3040,3080p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 1457

🏁 Script executed:

# Check if there's ever a case where the function is called with different devices but same SM version sed -n '3200,3250p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 1948

🏁 Script executed:

# Verify what happens when cache is reused - does it handle different devices correctly? sed -n '3170,3220p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 1696

🏁 Script executed:

# Check how TVM FFI stream and compilation work together rg "make_fake_stream\|use_tvm_ffi_env_stream" -B 3 -A 3 flashinfer/

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Check if get_hardware_info is truly cached globally and not per-device sed -n '1,50p' flashinfer/cute_dsl/utils.py

Repository: flashinfer-ai/flashinfer

Length of output: 1545

🏁 Script executed:

# Examine the runner factory pattern and whether runners are per-device or shared sed -n '3250,3300p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 1500

🏁 Script executed:

# Check if there's any mechanism to invalidate/update the cache per device rg "KERNEL_CACHE\|clear\|device" flashinfer/gemm/gemm_base.py | grep -i cache

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Check if there's any device context isolation in the execution sed -n '3195,3230p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 1051

🏁 Script executed:

# Look at how runners are created and cached at the module level sed -n '3260,3280p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 586

🏁 Script executed:

# Check if cute.compile produces portable or device-specific binaries rg "cute.compile\|TVM.*compile" -B 2 -A 2 flashinfer/ | head -30

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Verify if the cache could be shared across different device instantiations sed -n '3245,3260p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 767

🏁 Script executed:

# Check where the runner is called and understand the call context better sed -n '3175,3195p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 898

🏁 Script executed:

# Look at the full forward method signature to confirm sm_version availability sed -n '3030,3050p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 621

🏁 Script executed:

# Check what variables are in scope within the forward method sed -n '3020,3080p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 2012

🏁 Script executed:

# Confirm that we have access to device info and SM version in the forward method sed -n '2810,2820p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 465

🏁 Script executed:

# Verify that a.device is accessible in forward and has the needed attributes sed -n '3035,3045p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 384

🏁 Script executed:

# Check if sm_version is in the closure scope of the forward method sed -n '2810,2825p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 626

🏁 Script executed:

# Confirm the exact closure and class structure sed -n '2845,2860p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 623

Cache key must include device identity to prevent cross-device reuse in multi-GPU setups.

_CUTE_DSL_KERNEL_CACHE lacks device-specific information in its cache key. The current 9-tuple key includes kernel parameters but omits device identity; compiled GEMM binaries and max_active_clusters are device-specific. In multi-GPU or heterogeneous SM configurations, this allows incorrect reuse of kernels compiled for different devices. Include device type and index in the cache key:

🔧 Suggested fix (device-aware cache key)

# Cache key for compiled kernel + device_key = (a.device.type, a.device.index) cache_key = ( + device_key, sf_vec_size, mma_tiler_mn, cluster_shape_mn, swap_ab, use_prefetch, kernel_type, use_tma_store, enable_pdl, out_dtype, )

🤖 Prompt for AI Agents

In `@flashinfer/gemm/gemm_base.py` around lines 3068 - 3173, The cache key used to index _CUTE_DSL_KERNEL_CACHE must include the device identity to avoid reusing device-specific compiled kernels across GPUs; modify the construction of cache_key (the tuple currently containing sf_vec_size, mma_tiler_mn, cluster_shape_mn, swap_ab, use_prefetch, kernel_type, use_tma_store, enable_pdl, out_dtype) to also incorporate the executing device (derive from kernel_a.device — include device.type and device.index (or a stable sentinel like -1 if index is None)), and use that augmented cache_key when reading/writing _CUTE_DSL_KERNEL_CACHE for compiled_gemm and max_active_clusters so the lookup/store around compiled_gemm and max_active_clusters becomes device-aware.

coderabbitai · 2026-02-11T20:02:22Z

+            # Handle output tensor for swap_ab
+            if swap_ab:
+                launch_out = out.T
+            else:
+                launch_out = out
+
+            # Prepare alpha: ensure it is always a 1-dim tensor with shape [1].
+            # mm_fp4 may pass alpha as a 0-dim scalar tensor or a 1-dim tensor;
+            # the TVM FFI compiled kernel requires a consistent 1-dim shape.
+            if alpha_tensor is None:
+                alpha_for_launch = torch.tensor(
+                    [1.0], dtype=torch.float32, device=a.device
+                )
+            elif alpha_tensor.dim() == 0:
+                alpha_for_launch = alpha_tensor.unsqueeze(0)
+            else:
+                alpha_for_launch = alpha_tensor.reshape(1)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the context around lines 3176-3192 in the file head -3200 flashinfer/gemm/gemm_base.py | tail -100

Repository: flashinfer-ai/flashinfer

Length of output: 3737

🏁 Script executed:

# Get full view of the file structure wc -l flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 101

🏁 Script executed:

# Search for where launch_out is used/passed to kernel rg "launch_out" flashinfer/gemm/gemm_base.py -A 5 -B 5

Repository: flashinfer-ai/flashinfer

Length of output: 875

🏁 Script executed:

# Search for c_major parameter and layout handling rg "c_major" flashinfer/gemm/gemm_base.py -B 3 -A 3

Repository: flashinfer-ai/flashinfer

Length of output: 1375

🏁 Script executed:

# Look for swap_ab logic and how output is handled rg "swap_ab" flashinfer/gemm/gemm_base.py -B 5 -A 10

Repository: flashinfer-ai/flashinfer

Length of output: 10194

🏁 Script executed:

# Search for contiguity checks or .contiguous() calls in the file rg "contiguous|is_contiguous" flashinfer/gemm/gemm_base.py -B 2 -A 2

Repository: flashinfer-ai/flashinfer

Length of output: 2144

🏁 Script executed:

# Look at the kernel compilation code more carefully sed -n '3100,3200p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 3771

🏁 Script executed:

# Search for cute kernel implementation or documentation about swap_ab fd "cute_dsl" flashinfer/ --type f | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 114

🏁 Script executed:

# Check the cute.compile signature and kernel handling rg "def compile" flashinfer/ -B 2 -A 5 --type py | head -50

Repository: flashinfer-ai/flashinfer

Length of output: 1978

🏁 Script executed:

# Look for how output tensor is allocated/prepared before the gemm call sed -n '3000,3100p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 3781

🏁 Script executed:

# Find the complete compiled_gemm call sed -n '3190,3210p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 699

🏁 Script executed:

# Search for any contiguity handling for launch_out specifically rg "launch_out" flashinfer/gemm/gemm_base.py -B 3 -A 15

Repository: flashinfer-ai/flashinfer

Length of output: 1370

🏁 Script executed:

# Check if there's a contiguous() call or conversion for transposed tensors in swap_ab handling rg "swap_ab.*contiguous|contiguous.*swap_ab" flashinfer/gemm/gemm_base.py -B 5 -A 5

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Look at how kernel_a and kernel_b handle transposition (they use .T too) sed -n '3050,3080p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 1165

🏁 Script executed:

# Search for comments or documentation about swap_ab output handling rg "swap_ab" flashinfer/gemm/gemm_base.py | grep -i "output\|layout\|contiguous"

Repository: flashinfer-ai/flashinfer

Length of output: 114

🏁 Script executed:

# Check if there's kernel implementation that shows how output is handled find flashinfer/gemm/kernels -name "*.py" -o -name "*.cu" | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 319

🏁 Script executed:

# Look at the cute DSL utils to understand make_ptr and how pointers are handled cat flashinfer/gemm/kernels/cute_dsl_gemm_utils.py

Repository: flashinfer-ai/flashinfer

Length of output: 22577

🏁 Script executed:

# Search for any documentation or tests related to swap_ab output handling rg "swap_ab.*out|output.*swap" flashinfer/ --type py -B 2 -A 2 | head -50

Repository: flashinfer-ai/flashinfer

Length of output: 847

🏁 Script executed:

# Look for test files related to swap_ab fd "test.*gemm\|gemm.*test" flashinfer/ --type f | head -10

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Search for any validation or comments about swap_ab and output layout rg "swap_ab.*output\|output.*layout.*swap" flashinfer/ -B 2 -A 2 --type py

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Check if kernel actually needs contiguous output by examining kernel internals sed -n '1,100p' flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py | head -50

Repository: flashinfer-ai/flashinfer

Length of output: 2353

🏁 Script executed:

# Look for any contiguity enforcement in the gemm call itself rg "contiguous\|data_ptr" flashinfer/gemm/gemm_base.py | grep -A 2 -B 2 "launch_out\|swap_ab"

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Let me check if there's any handling of tensor contiguity for the output tensor sed -n '3000,3250p' flashinfer/gemm/gemm_base.py | grep -A 5 -B 5 "out = \|\.contiguous\|\.T"

Repository: flashinfer-ai/flashinfer

Length of output: 2141

🏁 Script executed:

# Look for how output tensor is created/allocated rg "out = torch\|torch.empty.*out" flashinfer/gemm/gemm_base.py -B 2 -A 2 | head -60

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Check kernel implementation to understand if it handles non-contiguous output sed -n '100,500p' flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py | grep -i "layout\|stride\|contiguous\|c_major" -A 2 -B 2

Repository: flashinfer-ai/flashinfer

Length of output: 6676

🏁 Script executed:

# Search for documentation or comments about output layout handling rg "output.*layout\|layout.*output" flashinfer/gemm/kernels/ -B 2 -A 2

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Look at the kernel call more carefully - what are the actual parameters being passed? sed -n '3190,3215p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 764

🏁 Script executed:

# Check if kernel_a and kernel_b which also use .T have any special handling rg "kernel_a.*data_ptr\|kernel_b.*data_ptr" flashinfer/gemm/gemm_base.py -B 3 -A 3

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Look for any stride information or layout handling in the kernel call rg "stride\|layout" flashinfer/gemm/gemm_base.py | grep -i "launch_out\|output\|kernel_[ab]" -B 2 -A 2

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Check if TVM FFI kernel expects strides or just pointers rg "tvm\|ffi" flashinfer/gemm/gemm_base.py | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Check actual kernel wrapper to see what parameters it receives rg "def wrapper\|kernel_m, kernel_n" flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py -A 10 | head -50

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Search for tests of mm_fp4 with swap_ab to see expected behavior fd "test" flashinfer/ --type f | xargs rg "mm_fp4\|swap_ab" 2>/dev/null | head -30

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Check documentation or comments about expected output layout with swap_ab rg "swap_ab.*True\|swap_ab.*output" flashinfer/ -B 3 -A 3 --type py | head -50

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Look at how out tensor is created before being passed to the kernel sed -n '3020,3060p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 1405

🏁 Script executed:

# Check if there's any usage of out after the gemm call that would reveal issues sed -n '3200,3210p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 376

🏁 Script executed:

# Let me check if there's any issue tracking or documentation about this rg "out\.T\|launch_out" flashinfer/ -B 2 -A 2 --type py | grep -v "\.pyc" | head -40

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Check if the kernel compilation with c_major parameter means it handles layout differences sed -n '3150,3170p' flashinfer/gemm/gemm_base.py

Repository: flashinfer-ai/flashinfer

Length of output: 697

🏁 Script executed:

# Look for any comments or evidence that transposed tensors work correctly rg "transpos|non.contiguous" flashinfer/gemm/ -B 2 -A 2 --type py

Repository: flashinfer-ai/flashinfer

Length of output: 4905

🏁 Script executed:

# Check if kernel_a and kernel_b being transposed .T causes issues (they're passed same way as launch_out) rg "kernel_a\|kernel_b" flashinfer/gemm/gemm_base.py | grep -c "\.T"

Repository: flashinfer-ai/flashinfer

Length of output: 69

🏁 Script executed:

# Check the wrapper function signature in the kernel to understand what parameters it receives sed -n '500,700p' flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py | grep -A 30 "def wrapper"

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Look for c_major or layout handling in the wrapper rg "c_major\|layout.*swap\|swap.*layout" flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py -B 3 -A 3

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Check if the kernel actually uses stride information or relies on layout from compilation rg "stride\|shape\|layout" flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py | head -30

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Search for any bug reports or issues related to swap_ab and output rg "swap_ab.*bug\|issue.*swap_ab\|error.*swap_ab" flashinfer/ --type py

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Look for any assertions or validations about output layout rg "assert\|ValueError.*layout\|check.*contiguous" flashinfer/gemm/gemm_base.py | grep -i "out\|layout\|swap"

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Check if kernel actually writes to wrong memory locations by looking at how c_major affects kernel behavior sed -n '1,150p' flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py | grep -i "c_major\|layout\|output"

Repository: flashinfer-ai/flashinfer

Length of output: 128

🏁 Script executed:

# Look for the actual kernel wrapper implementation details wc -l flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py

Repository: flashinfer-ai/flashinfer

Length of output: 128

🏁 Script executed:

# Find the wrapper function more carefully rg "wrapper.*=" flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py -B 5 -A 20 | head -80

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Look for how the kernel handles output writes rg "c_ref\|output.*write\|global.*store" flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py -B 3 -A 3 | head -50

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Search for the actual DSL wrapper definition sed -n '1800,2192p' flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py | head -100

Repository: flashinfer-ai/flashinfer

Length of output: 3952

🏁 Script executed:

# Search for any actual kernel implementation that writes to output sed -n '1500,1800p' flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py | grep -i "epilogue\|store\|c_ref" -B 2 -A 5

Repository: flashinfer-ai/flashinfer

Length of output: 3186

🏁 Script executed:

# Look for the actual wrapper function definition sed -n '2050,2192p' flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py

Repository: flashinfer-ai/flashinfer

Length of output: 5773

🏁 Script executed:

# Check if there's any real-world validation that this works correctly rg "test.*fp4\|fp4.*test" flashinfer/ -l --type py

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Look at how c_tensor is constructed with the layout - this is the key sed -n '2120,2160p' flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py

Repository: flashinfer-ai/flashinfer

Length of output: 1759

🏁 Script executed:

# Check if cute.make_tensor respects actual strides or assumes them based on order rg "make_tensor.*layout\|make_ordered_layout" flashinfer/ --type py -B 2 -A 2 | head -40

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Look for any documentation or warnings about non-contiguous tensors rg "contiguous\|stride\|layout" flashinfer/gemm/gemm_base.py | grep -B 2 -A 2 "output\|out.*=" | head -30

Repository: flashinfer-ai/flashinfer

Length of output: 50

Confirm kernel layout handling for transposed output. The kernel is JIT-compiled with swap_ab as a parameter and creates a tensor layout via cute.make_ordered_layout based on this flag. When swap_ab=True, it creates a row-major tensor layout (order=(0,1,2)) assuming strides match that layout. However, out.T is a non-contiguous view with different strides than row-major. Since the kernel receives only the base pointer (no stride metadata), it will interpret the memory using incorrect stride assumptions, causing incorrect writes. Either materialize the transposed output as a contiguous buffer and copy results back, or ensure the output tensor is allocated with the correct layout from the start.

🤖 Prompt for AI Agents

In `@flashinfer/gemm/gemm_base.py` around lines 3176 - 3192, The kernel assumes row-major memory when swap_ab=True but launch_out is set to the non-contiguous view out.T; change the launch path so the kernel receives a contiguous buffer with the expected layout: when swap_ab is True, allocate a temporary contiguous tensor with the row-major layout (or call out.clone().contiguous()) into which the kernel will write (this is the launch_out passed to the kernel), then after the kernel completes copy the results back into the original out via the appropriate transpose (e.g., out.copy_(temp.T)) and free the temp; alternatively, ensure out is originally allocated with the layout expected by cute.make_ordered_layout so no transpose view is used. Ensure this change is applied around the launch_out assignment and kernel invocation that uses swap_ab and interacts with cute.make_ordered_layout.

coderabbitai · 2026-02-11T20:02:23Z

+        def check_contigous_16B_alignment(dtype, is_mode0_major, tensor_shape):
+            major_mode_idx = 0 if is_mode0_major else 1
+            num_major_elements = tensor_shape[major_mode_idx]
+            num_contiguous_elements = 16 * 8 // dtype.width
+            return num_major_elements % num_contiguous_elements == 0


⚠️ Potential issue | 🟡 Minor

Typo: check_contigous_16B_alignment → check_contiguous_16B_alignment.

Minor typo in the inner helper name ("contigous" → "contiguous").

Proposed fix

- def check_contigous_16B_alignment(dtype, is_mode0_major, tensor_shape): + def check_contiguous_16B_alignment(dtype, is_mode0_major, tensor_shape):

Update the three call sites on lines 2024–2026 accordingly.

🤖 Prompt for AI Agents

In `@flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py` around lines 2017 - 2021, Rename the helper function check_contigous_16B_alignment to check_contiguous_16B_alignment and update all call sites that invoke it (the three places currently calling check_contigous_16B_alignment) to use the new name; ensure the function signature (dtype, is_mode0_major, tensor_shape) remains unchanged so callers still pass the same arguments and behavior is preserved.

bkryu

Thanks @nv-yunzheq , left a number a comments

nv-yunzheq · 2026-02-11T23:13:24Z

/bot run

flashinfer-bot · 2026-02-11T23:14:34Z

GitLab MR !311 has been created, and the CI pipeline #43830201 is currently running. I'll report back once the pipeline job completes.

bkryu

Thanks for updating. No concerns on my end but will wait for a few more pairs of eyes before approving

Dismissing "request for change" as requested changes have been made

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@flashinfer/gemm/gemm_base.py`:
- Around line 3392-3397: Docstring inconsistency: replace the underscore form
`cute_dsl` with the exact backend literal `"cute-dsl"` in backticks wherever it
appears in the docstring for the enable_pdl parameter (and the other occurrence
noted around line 3402) so the documentation matches the actual backend name;
update the text referencing enable_pdl to read `\"cute-dsl\"` (in backticks) to
ensure consistent naming across the docstring for the enable_pdl parameter and
related descriptive lines.

In `@flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py`:
- Around line 1807-1824: The computed stage counts can become non-positive;
after calculating num_ab_stage and refining num_c_stage (using smem_capacity,
occupancy, mbar_helpers_bytes, c_bytes, ab_bytes_per_stage, c_bytes_per_stage),
clamp them to safe minima (e.g., num_ab_stage = max(1, num_ab_stage) and
num_c_stage = max(2, num_c_stage)) or raise a clear exception if the tile
configuration is invalid; do this just before the return in the function that
computes stages so the pipeline never receives <=0 stages and include a short
error message if you choose to raise.

🧹 Nitpick comments (6)

flashinfer/gemm/gemm_base.py (3)
3194-3197: Avoid allocating a new tensor on every forward call when alpha is None.

torch.tensor([1.0], ...) allocates a new CUDA tensor on every invocation. For a hot GEMM path, consider caching the default alpha once (e.g., as an instance attribute or a module-level constant per device).
♻️ Suggested approach
+            # Cache a default alpha=1.0 tensor to avoid per-call allocation
+            _default_alpha_cache = {}
+
             # Prepare alpha: ensure it is always a 1-dim tensor with shape [1].
             if alpha_tensor is None:
-                alpha_for_launch = torch.tensor(
-                    [1.0], dtype=torch.float32, device=a.device
-                )
+                device = a.device
+                if device not in _default_alpha_cache:
+                    _default_alpha_cache[device] = torch.tensor(
+                        [1.0], dtype=torch.float32, device=device
+                    )
+                alpha_for_launch = _default_alpha_cache[device]
You could place _default_alpha_cache as a class attribute on CuteDSLFp4GemmRunner or a closure variable in _cute_dsl_gemm_fp4_runner.
2939-2950: Hoist get_device_properties call outside the loop.

torch.cuda.get_device_properties(a.device).multi_processor_count is called inside nested loops for each use_prefetch=True candidate. Move it before the loop to avoid repeated lookups.
♻️ Suggested change

Add before the for mma_tiler_mn loop (around line 2905):
sm_count = torch.cuda.get_device_properties(a.device).multi_processor_count
Then replace lines 2945-2947:
-                                sm_count = torch.cuda.get_device_properties(
-                                    a.device
-                                ).multi_processor_count
2808-2821: Noted: SM103 kernel disabled with clear TODO.

The commented-out SM103 import with the explanatory TODO and the explicit Sm103Kernel = None sentinel is clear. Consider tracking this with a GitHub issue so it doesn't get lost.

Would you like me to open an issue to track re-enabling the SM103 kernel once the cutlass-dsl package supports SM103MmaMXF4Op?
flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py (3)
1672-1678: tidx parameter is unused.

tidx is accepted but never referenced in the method body. If it's kept for API consistency with sibling epilog_* methods, consider prefixing with underscore (_tidx) to signal intent.

1918-1942: c_dtype and c_major parameters are unused.

These are accepted but never referenced in the validation logic. If they're placeholders for future constraints, consider adding a brief comment or prefixing with underscore.

1968-1968: Lambda assigned to a variable — prefer a def (Ruff E731).
Proposed fix
-        _is_power_of_2 = lambda x: x > 0 and (x & (x - 1)) == 0
+        def _is_power_of_2(x):
+            return x > 0 and (x & (x - 1)) == 0

coderabbitai · 2026-02-12T00:35:39Z

+    enable_pdl: bool
+        Whether to enable Programmatic Dependent Launch (PDL) for the ``cute_dsl``
+        backend, defaults to ``True``. PDL allows overlapping the tail of one kernel
+        with the start of the next for reduced launch latency. This parameter is
+        only used by the ``cute_dsl`` backend and is ignored by other backends.
+


⚠️ Potential issue | 🟡 Minor

Minor inconsistency: cute_dsl vs cute-dsl naming in docstring.

Line 3393 references cute_dsl (underscore) while the backend literal uses "cute-dsl" (hyphen). Consider using backticks with the exact backend string "cute-dsl" consistently to avoid user confusion.

📝 Suggested fix

enable_pdl: bool - Whether to enable Programmatic Dependent Launch (PDL) for the ``cute_dsl`` + Whether to enable Programmatic Dependent Launch (PDL) for the ``"cute-dsl"`` backend, defaults to ``True``. PDL allows overlapping the tail of one kernel with the start of the next for reduced launch latency. This parameter is - only used by the ``cute_dsl`` backend and is ignored by other backends. + only used by the ``"cute-dsl"`` backend and is ignored by other backends.

Also on line 3402:

- When cute_dsl backend is used, both a and b should be quantized with nvfp4_quantize using the 128x4 scale factor layout and do_shuffle=False (same as cudnn/cutlass). + When ``"cute-dsl"`` backend is used, both a and b should be quantized with nvfp4_quantize using the 128x4 scale factor layout and do_shuffle=False (same as cudnn/cutlass).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

enable_pdl: bool

Whether to enable Programmatic Dependent Launch (PDL) for the ``cute_dsl``

backend, defaults to ``True``. PDL allows overlapping the tail of one kernel

with the start of the next for reduced launch latency. This parameter is

only used by the ``cute_dsl`` backend and is ignored by other backends.

enable_pdl: bool

Whether to enable Programmatic Dependent Launch (PDL) for the ``"cute-dsl"``

backend, defaults to ``True``. PDL allows overlapping the tail of one kernel

with the start of the next for reduced launch latency. This parameter is

only used by the ``"cute-dsl"`` backend and is ignored by other backends.

🤖 Prompt for AI Agents

In `@flashinfer/gemm/gemm_base.py` around lines 3392 - 3397, Docstring inconsistency: replace the underscore form `cute_dsl` with the exact backend literal `"cute-dsl"` in backticks wherever it appears in the docstring for the enable_pdl parameter (and the other occurrence noted around line 3402) so the documentation matches the actual backend name; update the text referencing enable_pdl to read `\"cute-dsl\"` (in backticks) to ensure consistent naming across the docstring for the enable_pdl parameter and related descriptive lines.

coderabbitai · 2026-02-12T00:35:39Z

+        # Calculate A/B/SFA/SFB stages:
+        # Start with total smem per CTA (capacity / occupancy)
+        # Subtract reserved bytes and initial C stages bytes
+        # Divide remaining by bytes needed per A/B/SFA/SFB stage
+        num_ab_stage = (
+            smem_capacity // occupancy - (mbar_helpers_bytes + c_bytes)
+        ) // ab_bytes_per_stage
+
+        # Refine epilogue stages:
+        # Calculate remaining smem after allocating for A/B/SFA/SFB stages and reserved bytes
+        # Add remaining unused smem to epilogue
+        num_c_stage += (
+            smem_capacity
+            - occupancy * ab_bytes_per_stage * num_ab_stage
+            - occupancy * (mbar_helpers_bytes + c_bytes)
+        ) // (occupancy * c_bytes_per_stage)
+
+        return num_acc_stage, num_ab_stage, num_c_stage


⚠️ Potential issue | 🟡 Minor

No lower-bound guard on computed stage counts.

If smem_capacity / occupancy is too small for the chosen tile configuration, num_ab_stage (Line 1812) could compute to ≤ 0, and the C-stage refinement (Line 1818) could reduce num_c_stage below the initial value of 2. Both would lead to invalid pipeline configurations at runtime.

Consider adding a minimum-stage assertion or early-return:

Proposed guard

num_ab_stage = ( smem_capacity // occupancy - (mbar_helpers_bytes + c_bytes) ) // ab_bytes_per_stage + assert num_ab_stage >= 2, ( + f"Not enough shared memory for at least 2 A/B stages " + f"(got {num_ab_stage}). Consider reducing tile size or cluster shape." + ) # Refine epilogue stages:

🤖 Prompt for AI Agents

In `@flashinfer/gemm/kernels/dense_blockscaled_gemm_sm100.py` around lines 1807 - 1824, The computed stage counts can become non-positive; after calculating num_ab_stage and refining num_c_stage (using smem_capacity, occupancy, mbar_helpers_bytes, c_bytes, ab_bytes_per_stage, c_bytes_per_stage), clamp them to safe minima (e.g., num_ab_stage = max(1, num_ab_stage) and num_c_stage = max(2, num_c_stage)) or raise a clear exception if the tile configuration is invalid; do this just before the return in the function that computes stages so the pipeline never receives <=0 stages and include a short error message if you choose to raise.

flashinfer-bot · 2026-02-12T07:51:40Z

[FAILED] Pipeline #43830201: 16/20 passed

b8zhong · 2026-02-13T04:27:11Z

    -----
    When cudnn/cutlass backend is used, both a and b should quantized with nvfp4_quantize using the 128x4 scale factor layout and do_shuffle=False.
    When trtllm backend is used, b must be quantized with 128x4 layout and `do_shuffle=True`. a can be quantized with either 128x4 or 8x4 layout (controlled by `use_8x4_sf_layout`) and `do_shuffle=False`.
+    When cute_dsl backend is used, both a and b should be quantized with nvfp4_quantize using the 128x4 scale factor layout and do_shuffle=False (same as cudnn/cutlass).


Hi @nv-yunzheq, I have a quick question, I am just wondering if the scale layout is the same, is there any reason for not using this for auto-tuner when backend=auto? Or can it still be considered experimental for now

Hi @b8zhong, yes it is meant to be experimental at least for this PR; nothing inherently blocking the cute-dsl backend from being considered for autotuning when backend=auto. We'd like to get the kernel in first knowing that it has satisfactory perf, and then perform a more detailed perf study.

We may include the cute-dsl backend as a config that can be autotuned in a followup PR.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/gemm/gemm_base.py`:
- Around line 2957-2962: The new parameters backend, use_nvfp4, and enable_pdl
are triggering Ruff ARG001/ARG002 unused-argument warnings; to silence them,
explicitly consume or acknowledge the arguments in each function that declares
them (e.g., add a line like "_ = backend, use_nvfp4, enable_pdl" near the top of
the function) or append a per-function noqa (e.g., "# noqa: ARG001") to the def
line; apply this change consistently for every function that declares these
parameters (the overloads/definitions using backend, use_nvfp4, enable_pdl in
this file).

---

Duplicate comments:
In `@flashinfer/gemm/gemm_base.py`:
- Around line 3564-3567: The non-contiguous view created by launch_out = out.T
when swap_ab is True can break kernel assumptions; change this to produce a
contiguous tensor (e.g., launch_out = out.T.contiguous() or launch_out =
out.transpose(0,1).contiguous()) so launch_out is contiguous in memory before
passing to the kernel; update the swap_ab branch where launch_out, swap_ab, and
out.T are used to ensure the contiguous output is supplied.
- Around line 3444-3455: The cache key tuple named cache_key (constructed from
sf_vec_size, mma_tiler_mn, cluster_shape_mn, swap_ab, use_prefetch, kernel_type,
use_tma_store, enable_pdl, out_dtype) is missing any device identity and can
incorrectly reuse kernels across GPUs; update the cache_key to include a
device-unique identifier (e.g., the CUDA device ordinal or a stable device
identifier such as PCI bus id / device UUID or
torch.cuda.get_device_properties(device).name+index) so compiled kernels are
cached per-device. Ensure you retrieve the current device from the same context
where kernels are compiled and append that identifier to the cache_key tuple.

coderabbitai · 2026-02-17T23:42:36Z

+    backend: Literal[
+        "cudnn", "trtllm", "cutlass", "cute-dsl", "auto"
+    ] = "auto",  # unused
    use_nvfp4: bool = True,
+    enable_pdl: bool = True,  # unused
 ):


⚠️ Potential issue | 🟡 Minor

Silence Ruff unused-argument warnings for new backend/PDL parameters.

Ruff flags these as unused (ARG001/ARG002). If lint is enforced, consider explicitly consuming them (e.g., _ = backend, enable_pdl) or adding a # noqa: ARG001 on the def line.

💡 Example pattern

def _check_mm_fp4_problem_size(..., backend=..., use_nvfp4=True, enable_pdl=True): + _ = backend, enable_pdl

Also applies to: 3017-3022, 3081-3086, 3108-3113, 3121-3153, 3612-3614

🧰 Tools

🪛 Ruff (0.15.1)

[warning] 2957-2957: Unused function argument: backend

(ARG001)

[warning] 2961-2961: Unused function argument: enable_pdl

(ARG001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@flashinfer/gemm/gemm_base.py` around lines 2957 - 2962, The new parameters backend, use_nvfp4, and enable_pdl are triggering Ruff ARG001/ARG002 unused-argument warnings; to silence them, explicitly consume or acknowledge the arguments in each function that declares them (e.g., add a line like "_ = backend, use_nvfp4, enable_pdl" near the top of the function) or append a per-function noqa (e.g., "# noqa: ARG001") to the def line; apply this change consistently for every function that declares these parameters (the overloads/definitions using backend, use_nvfp4, enable_pdl in this file).

coderabbitai

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@flashinfer/gemm/gemm_base.py`:
- Around line 3818-3829: The docstring uses the backend name cute_dsl
inconsistently with the project's canonical backend id "cute-dsl"; update the
occurrences in this docstring (the Notes block and the enable_pdl description)
to use "cute-dsl" (including quotes where other backend names are quoted) so the
naming matches other docs and earlier comments referencing the cute-dsl backend
and the enable_pdl parameter.
- Around line 3458-3576: The cache key built for _CUTE_DSL_MM_FP4_KERNEL_CACHE
(variable cache_key) is missing device identity, causing cross-device kernel
reuse; update the cache_key creation in gemm_base.py to include the current CUDA
device identifier (e.g., torch.cuda.current_device() or equivalent from the
cute/runtime/stream) so compiled_gemm and max_active_clusters are cached
per-device; ensure the same device id is used when looking up and storing
entries in _CUTE_DSL_MM_FP4_KERNEL_CACHE (refer to symbols cache_key,
_CUTE_DSL_MM_FP4_KERNEL_CACHE, compiled_gemm, max_active_clusters).
- Around line 3578-3583: The swap_ab branch assigns a non-contiguous view via
out.T to launch_out which can break downstream kernels; instead ensure
launch_out is a contiguous transposed tensor by replacing the out.T usage with
an explicit transpose followed by making it contiguous (e.g., use
out.transpose(...).contiguous() or out.t().contiguous()) so that launch_out is
contiguous when swap_ab is true; update the block that sets launch_out (the
swap_ab conditional around launch_out and out) accordingly.
- Around line 2961-3151: The Ruff warnings come from unused parameters (backend,
enable_pdl, and similar) introduced in the FP4 requirement helpers; to silence
them, explicitly mark those parameters as deliberately unused by either renaming
to a leading-underscore variant or adding a single-line discard (e.g., del
backend, enable_pdl) at the top of each affected function; apply this change in
_check_mm_fp4_problem_size, _cudnn_gemm_fp4_requirement,
_trtllm_gemm_fp4_requirement, _cutlass_gemm_fp4_requirement, and
_cute_dsl_gemm_fp4_requirement so Ruff no longer reports unused-argument
warnings while keeping the API unchanged.

nv-yunzheq · 2026-02-19T22:21:25Z

@flashinfer-bot rerun failed

yongwww · 2026-02-19T22:27:59Z

@flashinfer-bot stop

yongwww · 2026-02-19T22:29:20Z

@flashinfer-bot rerun failed

yongwww · 2026-02-19T22:37:02Z

/bot run

flashinfer-bot · 2026-02-19T22:37:48Z

GitLab MR !311 has been updated with latest changes, and the CI pipeline #44404621 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-02-20T02:43:18Z

[FAILED] Pipeline #44404621: 9/20 passed

dhiraj113 · 2026-02-21T00:11:08Z

+        "cudnn", "trtllm", "cutlass", "cute-dsl", "auto"
+    ] = "auto",  # unused
    use_nvfp4: bool = True,
+    enable_pdl: bool = True,  # unused


Is this unsed? If so, why has it been added?

This is the function to check if the given operation is runable. It has to be with the same exact argurment as mm_fp4 function itself. However, some of parameters are not used in this support check function

dhiraj113 · 2026-02-21T00:13:08Z

+        "cudnn", "trtllm", "cutlass", "cute-dsl", "auto"
+    ] = "auto",  # unused
+    use_nvfp4: bool = True,
+    enable_pdl: bool = True,  # unused


Why are all these arguments marked as #unsed?

As mentioned above, it needs to have the same function signature as mm_fp4. However, when checking if cute_dsl backend is viable, we don't need any of these input parameters to determine if it's runable, we only check if cute_dsl is installed or not.
The #unused is for pre-commit check. The pre-commit reformating would reject a function with unused parameters. We mark it to surpass the behavior

nv-yunzheq added 3 commits February 11, 2026 10:10

init

dff7bf3

fix sm103 error

cd29907

disable sm103 until nvidia-cutlass-dsl 4.4.0 release

982c3a4

nv-yunzheq requested review from Anerudhan, bkryu, cyx-6, jiahanc, jimmyzho, nvmbreughe and yongwww as code owners February 11, 2026 19:49

nv-yunzheq changed the title ~~Cute dsl mmfp4~~ feat: cute dsl mmfp4 Feb 11, 2026

nv-yunzheq changed the title ~~feat: cute dsl mmfp4~~ feat: cute dsl mmfp4 for blackwell Feb 11, 2026

gemini-code-assist Bot reviewed Feb 11, 2026

View reviewed changes

coderabbitai Bot reviewed Feb 11, 2026

View reviewed changes

bkryu previously requested changes Feb 11, 2026

View reviewed changes

dhiraj113 reviewed Feb 11, 2026

View reviewed changes

Comment thread flashinfer/gemm/kernels/cute_dsl_gemm_utils.py Outdated

update name to cute-dsl and change the utils import

e2bb645

use cute tensor

f2a61ed

bkryu reviewed Feb 12, 2026

View reviewed changes

coderabbitai Bot reviewed Feb 12, 2026

View reviewed changes

b8zhong reviewed Feb 13, 2026

View reviewed changes

Merge branch 'main' into cute_dsl_mmfp4

7a502b1

coderabbitai Bot reviewed Feb 17, 2026

View reviewed changes

yzh119 approved these changes Feb 19, 2026

View reviewed changes

bkryu approved these changes Feb 19, 2026

View reviewed changes

Merge branch 'main' into cute_dsl_mmfp4

5cfce14

nv-yunzheq enabled auto-merge (squash) February 19, 2026 17:53

resolve merge conflict

60b8616

coderabbitai Bot reviewed Feb 19, 2026

View reviewed changes

Merge branch 'main' into cute_dsl_mmfp4

7cfc1fc

dhiraj113 reviewed Feb 21, 2026

View reviewed changes

dhiraj113 approved these changes Feb 21, 2026

View reviewed changes

nv-yunzheq merged commit 04c1b7b into flashinfer-ai:main Feb 21, 2026
38 of 44 checks passed

coderabbitai Bot mentioned this pull request Feb 28, 2026

feat: support mxfp4 & mxfp8 entrypoint for blackwell cutedsl dense gemm #2660

Merged

nv-yunzheq deleted the cute_dsl_mmfp4 branch March 2, 2026 19:53

coderabbitai Bot mentioned this pull request Mar 12, 2026

chore: cute dsl nvfp4 moe clean up #2775

Merged

5 tasks

coderabbitai Bot mentioned this pull request Mar 25, 2026

[WIP][Do not review] feat: enable sm103 fp4 gemm #2888

Draft

5 tasks

This was referenced Apr 1, 2026

CuTe DSL FP4 GEMM Heuristic #2940

Merged

feat: Add CuTe DSL grouped-gemm + combine fusion support #2944

Open

coderabbitai Bot mentioned this pull request Apr 21, 2026

Add TGV NVFP4 GEMM tactic to mm_fp4 cute-dsl backend (SM100/SM103) #3141

Open

3 tasks

b8zhong mentioned this pull request Apr 23, 2026

Reland Cute-DSL FP4 dense GEMM sgl-project/sglang#23590

Open

Conversation

nv-yunzheq commented Feb 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

MMFP4 Benchmark Results

GB200 Non-Autotune

GB200 Autotune

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot commented Feb 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

bkryu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nv-yunzheq commented Feb 11, 2026

Uh oh!

flashinfer-bot commented Feb 11, 2026

Uh oh!

bkryu left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

flashinfer-bot commented Feb 12, 2026

nv-yunzheq commented Feb 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 11, 2026 •

edited

Loading

bkryu Feb 13, 2026 •

edited

Loading

dhiraj113 Feb 21, 2026 •

edited

Loading